HTML sanitization: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 10:14, 12 March 2022 edit Widefox (talk \| contribs) Autopatrolled, Extended confirmed users, Page movers, IP block exemptions, New page reviewers, Pending changes reviewers, Rollbackers 110,632 edits m Filled in 1 bare reference(s) with reFill 2 ← Previous edit		Latest revision as of 10:05, 7 December 2023 edit undo Frap (talk \| contribs) Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 35,592 edits No edit summary
(7 intermediate revisions by 6 users not shown)
Line 1: {{Short description\|Process of removing undesirable parts of an HTML document}} {{~~Refimprove~~More citations needed\|date=December 2009}} In [[data sanitization]], '''HTML sanitization''' is the process of examining an [[HTML]] document and producing a new HTML document that preserves only whatever tags and attributes are designated "safe" and desired. HTML sanitization can be used to protect against attacks such as [[cross-site scripting]] (XSS) by sanitizing any HTML code submitted by a user. == Details == Basic tags for changing fonts are often allowed, such as <code><b></code>, <code><i></code>, <code><u></code>, <code><em></code>, and <code><strong></code> while more advanced tags such as <code><script></code>, <code><object></code>, <code><embed></code>, and <code><link></code> are removed by the sanitization process. Also potentially dangerous [[HTML attribute\|attributes]] such as the <code>onclick</code> attribute are removed in order to prevent malicious code from being injected. Sanitization is typically performed by using either a [[whitelist]] or a [[Blacklist (computing)\|blacklist]] approach. Leaving a safe HTML element off a whitelist is not so serious; it simply means that that feature will not be included post-sanitation. On the other hand, if an unsafe element is left off a blacklist, then the vulnerability will not be sanitized out of the HTML output. An out-of-date blacklist can therefore be dangerous if new, unsafe features have been introduced to the HTML Standard. Line 14 ⟶ 15: In [[Java (programming language)\|Java]] (and [[.NET Framework\|.NET]]), sanitization can be achieved by using the [[OWASP]] Java HTML Sanitizer Project.<ref>{{Cite web\|url=https://www.owasp.org/index.php/OWASP_Java_HTML_Sanitizer_Project\|title = OWASP Java HTML Sanitizer}}</ref> In [[.NET Framework\|.NET]], a number of sanitizers use the Html Agility Pack, an HTML parser.<ref>{{Cite web \|url=http://htmlagilitypack.codeplex.com/ \|title=HTML Agility Pack - Home \|access-date=2013-01-04 \|archive-date=2013-01-01 \|archive-url=https://web.archive.org/web/20130101170916/http://htmlagilitypack.codeplex.com/ \|url-status=dead }}</ref><ref>{{Cite web\|url=http://eksith.wordpress.com/2011/06/14/whitelist-santize-htmlagilitypack/\|title = Whitelist santize with HtmlAgilityPack\|date = 14 June 2011}}</ref><ref name="HtmlRuleSanitizer" /> Another library is HtmlSanitizer.<ref>{{cite web \|last1=Ganss \|first1=Michael \|title=HtmlSanitizer \|url=https://github.com/mganss/HtmlSanitizer/ \|access-date=7 December 2023 \|date=5 December 2023}}</ref> In [[JavaScript]] there are "JS-only" sanitizers for the [[front and back ends\|back end]], and browser-based<ref>{{Cite web\|url=https://github.com/jitbit/HtmlSanitizer\|title=JS HTML Sanitizer\|website=[[GitHub]]\|date=14 October 2021}}</ref> implementations that use browser's own [[Document Object Model]] (DOM) parser to parse the HTML (for better performance).