@librarian-bot\n\t

\n","updatedAt":"2024-11-05T08:26:16.567Z","author":{"_id":"6355deabe1e6a03a1ed25b86","avatarUrl":"/avatars/8149f493cda4c818c6c815924236891b.svg","fullname":"lichaochao","name":"chaochaoli","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7558995485305786},"editors":["chaochaoli"],"editorAvatarUrls":["/avatars/8149f493cda4c818c6c815924236891b.svg"],"reactions":[],"isReport":false}},{"id":"6740d46775e06ed1ea07fb46","author":{"_id":"63110e6caf1fce227a3d792b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63110e6caf1fce227a3d792b/-NVx_7cyu4b54fWMe3Kr_.jpeg","fullname":"Will Kurt","name":"willkurt","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false},"createdAt":"2024-11-22T18:58:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hi! This Will from the [.txt](https://dottxt.co/) team. Given our focus on structured generation we took the claims in this paper quite seriously and did some investigation into what happened to produce these surprising results which contradicted our own past experiments. Here is our response [Say What You Mean: A Response to 'Let Me Speak Freely'](https://blog.dottxt.co/say-what-you-mean.html).\n\nFor a tl;dr here are the main points we bring up (particularly concerning are points 2 and 3):\n\n1. The paper itself finds that structured generation has superior performance on a number of classification tasks.\n2. The prompts used for unstructured (NL) generation are markedly different than the ones used for structured generation, so the comparisons are not apples-to-apples to begin with.\n3. The structured generation prompts do not provide the model with adequate information to solve the task, this leads to particularly poor performance for the ‘json-mode’ examples.\n4. The real meat of the paper is actually about parsing the results of one LLM with a second LLM. The authors refer to this as the “Perfect Text Parser”.\n5. The paper confuses structured generation with JSON-mode, although independent runs of these evals show that “JSON-mode” yields better results than unstructured generation.\n","html":"

Hi! This Will from the .txt team. Given our focus on structured generation we took the claims in this paper quite seriously and did some investigation into what happened to produce these surprising results which contradicted our own past experiments. Here is our response Say What You Mean: A Response to 'Let Me Speak Freely'.

For a tl;dr here are the main points we bring up (particularly concerning are points 2 and 3):

The paper itself finds that structured generation has superior performance on a number of classification tasks.
The prompts used for unstructured (NL) generation are markedly different than the ones used for structured generation, so the comparisons are not apples-to-apples to begin with.
The structured generation prompts do not provide the model with adequate information to solve the task, this leads to particularly poor performance for the ‘json-mode’ examples.
The real meat of the paper is actually about parsing the results of one LLM with a second LLM. The authors refer to this as the “Perfect Text Parser”.
The paper confuses structured generation with JSON-mode, although independent runs of these evals show that “JSON-mode” yields better results than unstructured generation.

\n","updatedAt":"2024-11-22T18:58:47.342Z","author":{"_id":"63110e6caf1fce227a3d792b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63110e6caf1fce227a3d792b/-NVx_7cyu4b54fWMe3Kr_.jpeg","fullname":"Will Kurt","name":"willkurt","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.919528603553772},"editors":["willkurt"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63110e6caf1fce227a3d792b/-NVx_7cyu4b54fWMe3Kr_.jpeg"],"reactions":[{"reaction":"👍","users":["blueyo04288"],"count":1},{"reaction":"👀","users":["ucyang"],"count":1}],"isReport":false}},{"id":"675ccf19c206f79e19f46c9c","author":{"_id":"65dc249d6b8ab39009e0f688","avatarUrl":"/avatars/62efb9f9d3d1ce16cf83f0c6c48601b5.svg","fullname":"Rey","name":"appier-rey","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2024-12-14T00:19:37.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I think this simple result says it all, yes prompt matters but 0-shot CoT already enough to proof it all.\n\n| Last Letter Llama 3 Instruct | 0-shot CoT | 1-shot CoT (used in blog) | .txt reported best : JSON (struct) 1-shot CoT |\n|-------------|------------|------------|--------|\n| lastletter-t3-f3 | **78.00*** | 57.33* | 77.00 (T4-F1) |\n| Average of 9 prompts | 70.07* | 44.64* | - |\n\nFor more details just read this updated note : https://github.com/appier-research/structure-gen/blob/main/updates.md\n\nWe included JSON structured generation with averaged results from different prompts and its still worse.\n\n","html":"

I think this simple result says it all, yes prompt matters but 0-shot CoT already enough to proof it all.

\n\t\n\t\t\n\n\n\n\n\n\n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\t

Last Letter Llama 3 Instruct	0-shot CoT	1-shot CoT (used in blog)	.txt reported best : JSON (struct) 1-shot CoT
lastletter-t3-f3	78.00*	57.33*	77.00 (T4-F1)
Average of 9 prompts	70.07*	44.64*	-

For more details just read this updated note : https://github.com/appier-research/structure-gen/blob/main/updates.md

We included JSON structured generation with averaged results from different prompts and its still worse.

\n","updatedAt":"2024-12-14T00:19:46.233Z","author":{"_id":"65dc249d6b8ab39009e0f688","avatarUrl":"/avatars/62efb9f9d3d1ce16cf83f0c6c48601b5.svg","fullname":"Rey","name":"appier-rey","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.7921150922775269},"editors":["appier-rey"],"editorAvatarUrls":["/avatars/62efb9f9d3d1ce16cf83f0c6c48601b5.svg"],"reactions":[{"reaction":"❤️","users":["theblackcat102"],"count":1},{"reaction":"👀","users":["ucyang"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2408.02442","authors":[{"_id":"66b8fe80c485be2eb6f4e414","name":"Zhi Rui Tam","hidden":false},{"_id":"66b8fe80c485be2eb6f4e415","name":"Cheng-Kuang Wu","hidden":false},{"_id":"66b8fe80c485be2eb6f4e416","name":"Yi-Lin Tsai","hidden":false},{"_id":"66b8fe80c485be2eb6f4e417","name":"Chieh-Yen Lin","hidden":false},{"_id":"66b8fe80c485be2eb6f4e418","name":"Hung-yi Lee","hidden":false},{"_id":"66b8fe80c485be2eb6f4e419","name":"Yun-Nung Chen","hidden":false}],"publishedAt":"2024-08-05T13:08:24.000Z","title":"Let Me Speak Freely? A Study on the Impact of Format Restrictions on\n Performance of Large Language Models","summary":"Structured generation, the process of producing content in standardized\nformats like JSON and XML, is widely utilized in real-world applications to\nextract key output information from large language models (LLMs). This study\ninvestigates whether such constraints on generation space impact LLMs'\nabilities, including reasoning and domain knowledge comprehension.\nSpecifically, we evaluate LLMs' performance when restricted to adhere to\nstructured formats versus generating free-form responses across various common\ntasks. Surprisingly, we observe a significant decline in LLMs' reasoning\nabilities under format restrictions. Furthermore, we find that stricter format\nconstraints generally lead to greater performance degradation in reasoning\ntasks.","upvotes":21,"discussionId":"66b8fe81c485be2eb6f4e471","githubRepo":"https://github.com/appier-research/structure-gen","githubRepoAddedBy":"auto","ai_summary":"LLMs exhibit reduced reasoning abilities when required to generate content in structured formats compared to free-form responses.","ai_keywords":["structured generation","JSON","XML","large language models","LLMs","reasoning","domain knowledge comprehension"],"githubStars":26},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"651e93137b2a2e027f9e55df","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651e93137b2a2e027f9e55df/5oXWJeEDCrMJLA4s_0I93.png","isPro":false,"fullname":"Aurélien-Morgan CLAUDON","user":"Aurelien-Morgan","type":"user"},{"_id":"5ff5d596f244529b3ec0fb89","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1624629516652-5ff5d596f244529b3ec0fb89.png","isPro":false,"fullname":"Philipp Schmid","user":"philschmid","type":"user"},{"_id":"6178238164df7a02a0fb6b6c","avatarUrl":"/avatars/a30c709e5f95b0e419bb325c40805c05.svg","isPro":false,"fullname":"Hagen H.","user":"h4gen","type":"user"},{"_id":"652aa623e0f39e3bae3aca11","avatarUrl":"/avatars/f7177fe6ff194dce5ac9de447d22093c.svg","isPro":false,"fullname":"Ali Alperen Sönmez","user":"Alperens1","type":"user"},{"_id":"626237d9bbcbd1c34f1bb231","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/626237d9bbcbd1c34f1bb231/EJrOjvAL-68qMCYdnvOrq.png","isPro":true,"fullname":"Ali El Filali","user":"alielfilali01","type":"user"},{"_id":"643c14a086ab6dbe34eee8be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Jk64HQGdQqGWogxvpjEIn.png","isPro":false,"fullname":"Bora Karagül","user":"borakaragul","type":"user"},{"_id":"63804adb9f1f158b014cf1a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63804adb9f1f158b014cf1a6/Y9b9ZkCJwK9j1RJIkvHMJ.jpeg","isPro":false,"fullname":"Phan Hoang","user":"phanhoang","type":"user"},{"_id":"61d6f47bd49065ee28a1ee7d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61d6f47bd49065ee28a1ee7d/RCcQBAABHU9fVVqzoDqTA.jpeg","isPro":false,"fullname":"Frederic Branchaud-Charron","user":"Dref360","type":"user"},{"_id":"655ec30b12fb73960ceb048f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655ec30b12fb73960ceb048f/q7zVSStJWBywrtPoL2ChO.png","isPro":false,"fullname":"Sina Tayebati","user":"sinatayebati","type":"user"},{"_id":"657217faabb25ed8aedd5e48","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657217faabb25ed8aedd5e48/UUHAXeGtOnQBXFD3nYtf2.jpeg","isPro":false,"fullname":"Vlad Bogolin","user":"vladbogo","type":"user"},{"_id":"63107b18e87051f3e3e0f598","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63107b18e87051f3e3e0f598/R9onir4Y0MZuq1jEWCZ2-.jpeg","isPro":false,"fullname":"Unchun Yang","user":"ucyang","type":"user"},{"_id":"6435d564a4bd75c62cc03701","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6435d564a4bd75c62cc03701/7P2G_wVNB6MISp2Phh427.jpeg","isPro":false,"fullname":"Agustín Piqueres Lajarín","user":"plaguss","type":"user"}],"acceptLanguages":["*"]}">

Papers

arxiv:2408.02442

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Published on Aug 5, 2024

Upvote

Authors:

Abstract

LLMs exhibit reduced reasoning abilities when required to generate content in structured formats compared to free-form responses.

AI-generated summary

Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world applications to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation space impact LLMs' abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs' performance when restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a significant decline in LLMs' reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead to greater performance degradation in reasoning tasks.

View arXiv page View PDF GitHub 26 auto Add to collection

Community

theblackcat102

Oct 17, 2024

We recently added gpt-4o-mini-2024-07-18 results with OpenAI latest structure outputs API on our v3 edition

chaochaoli

Nov 5, 2024

@librarian-bot

willkurt

Nov 22, 2024

For a tl;dr here are the main points we bring up (particularly concerning are points 2 and 3):

The paper itself finds that structured generation has superior performance on a number of classification tasks.
The prompts used for unstructured (NL) generation are markedly different than the ones used for structured generation, so the comparisons are not apples-to-apples to begin with.
The structured generation prompts do not provide the model with adequate information to solve the task, this leads to particularly poor performance for the ‘json-mode’ examples.
The real meat of the paper is actually about parsing the results of one LLM with a second LLM. The authors refer to this as the “Perfect Text Parser”.
The paper confuses structured generation with JSON-mode, although independent runs of these evals show that “JSON-mode” yields better results than unstructured generation.

appier-rey

Dec 14, 2024

•

edited Dec 14, 2024

I think this simple result says it all, yes prompt matters but 0-shot CoT already enough to proof it all.

Last Letter Llama 3 Instruct	0-shot CoT	1-shot CoT (used in blog)	.txt reported best : JSON (struct) 1-shot CoT
lastletter-t3-f3	78.00*	57.33*	77.00 (T4-F1)
Average of 9 prompts	70.07*	44.64*	-

For more details just read this updated note : https://github.com/appier-research/structure-gen/blob/main/updates.md

We included JSON structured generation with averaged results from different prompts and its still worse.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 8

Browse 8 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.02442 in a Space README.md to link it from this page.