Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many errors on non-breaking space characters in source code #106101

Closed
dtolnay opened this issue Dec 23, 2022 · 1 comment · Fixed by #106566 or #106872
Closed

Too many errors on non-breaking space characters in source code #106101

dtolnay opened this issue Dec 23, 2022 · 1 comment · Fixed by #106566 or #106872
Assignees
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-parser Area: The parsing of Rust source code to an AST C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@dtolnay
Copy link
Member

dtolnay commented Dec 23, 2022

Rustc emits a separate error for every single time U+00A0 appears in the source file.

(See #106098 for how someone might very reasonably end up with non-breaking space characters in their source code. Even if that issue gets resolved in rustdoc, I still think rustc's parser needs to handle this better, because non-breaking spaces might be copied in from some other website, or from documentation rendered by older versions of rustdoc.)

My preferred behavior would be that rustc should emit just a single error on the first non-breaking space in the entire file. Then silently interpret every subsequent non-breaking space in the file as an ordinary space.

If that is too tricky, a more conservative change that would still be an improvement would be to emit a single error for a consecutive sequence of non-breaking space characters (i.e. this would typically result in one error per line, instead of one error per space).

Repro:

$ echo -e '\u00a0\u00a0\u00a0\u00a0fn main() {}' | rustc /dev/stdin -o a.out
error: unknown start of token: \u{a0}
 --> /dev/stdin:1:1
  |
1 |     fn main() {}
  | ^
  |
help: Unicode character ' ' (No-Break Space) looks like ' ' (Space), but it is not
  |
1 |     fn main() {}
  | +

error: unknown start of token: \u{a0}
 --> /dev/stdin:1:2
  |
1 |     fn main() {}
  |  ^
  |
help: Unicode character ' ' (No-Break Space) looks like ' ' (Space), but it is not
  |
1 |     fn main() {}
  |  +

error: unknown start of token: \u{a0}
 --> /dev/stdin:1:3
  |
1 |     fn main() {}
  |   ^
  |
help: Unicode character ' ' (No-Break Space) looks like ' ' (Space), but it is not
  |
1 |     fn main() {}
  |   +

error: unknown start of token: \u{a0}
 --> /dev/stdin:1:4
  |
1 |     fn main() {}
  |    ^
  |
help: Unicode character ' ' (No-Break Space) looks like ' ' (Space), but it is not
  |
1 |     fn main() {}
  |    +

error: aborting due to 4 previous errors
@dtolnay dtolnay added A-parser Area: The parsing of Rust source code to an AST T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-bug Category: This is a bug. A-diagnostics Area: Messages for errors, warnings, and lints labels Dec 23, 2022
@bors bors closed this as completed in 8e0eecd Jan 14, 2023
@dtolnay dtolnay reopened this Jan 14, 2023
@dtolnay
Copy link
Member Author

dtolnay commented Jan 14, 2023

Reopening because I think the fix in #106566 was too conservative. I think subsequent contiguous nbsp sequences in the file can safely be treated as whitespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Messages for errors, warnings, and lints A-parser Area: The parsing of Rust source code to an AST C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
1 participant