jerakeen.org

by Tom Insam

notes☴

code☷

links☲

photos☵

rgrove’s sanitize

rgrove's sanitize

created 21 August 2009 in links tagged html, ruby and sanitizer.

Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string

http://github.com/rgrove/sanitize/tree/master

Unobtrusive Table Sort Script

Unobtrusive Table Sort Script

created 10 February 2009 in links tagged html, javascript, sort and table.

Another Javascript html table sorter.

http://www.frequency-decoder.com/2006/09/16/unobtrusive-t...

Tablesorter 2.0

Tablesorter 2.0

created 10 February 2009 in links tagged html, jquery, sorting and tablet.

jQuery plugin to make tables sortable by clicking the headers.

http://tablesorter.com/docs/

A Warning About the Real Cost of Microformats

A Warning About the Real Cost of Microformats

created 06 February 2009 in links tagged api, compatibility, development, html and microformats.

Would you create a real developer API without a TOS, agreement, or at the very least, guidelines? Are you prepared to deal with objections if, when cutting costs, you rev a frontend design and lose some important aspect of microformat structure on the page (or, god forbid, you just don’t bring microformats over at all). Alternatively, are you prepared to announce all frontend markup changes? Does publishing a microformat without a special agreement mean that you are implicitly allowing comprehensive scraping of your web data?

http://getluky.net/2009/01/08/a-warning-about-the-real-co...

Web Development Bookmarklets

Web Development Bookmarklets

created 19 November 2008 in links tagged bookmarklet, development, html and web.

useful-looking bookmarklets, for when I’m not using firefox, and therefore firebug, or safari. Which is a lot recently, because they’re both annoying me.

https://www.squarefree.com/bookmarklets/webdevel.html

A Guide to CSS Support in Email

A Guide to CSS Support in Email

created 15 August 2008 in links tagged css, email and html.

sending HTML email is hard. A guide to what things do and don’t work.

http://www.campaignmonitor.com/css/

CSS System Color Keywords

CSS System Color Keywords

created 14 August 2008 in links tagged colour, css, html and system.

CSS colour names for system widget colours.

http://webdesign.about.com/od/colorcharts/l/blsystemcolor...

Sanitising comments with Python

created 18 May 2008 in blog tagged comments, html and python.

As is my wont, I’m in the middle of porting jerakeen.org to another back-end. This time, I’m porting it back to the Django-based Python version (it’s been written in rails for a few months now). It’s grown a few more features, and one of them is somewhat smarter comment parsing.

This being a vaguely technical blog, I have vaguely technical people leaving comments. And most of them want to be able to use HTML. I’ve seen blogs that allow markdown in comments, but I hate that - unless you’re know you’re writing it, it’s too easy for markdown to do things like eat random underscores and italicise the rest of the sentence by accident. But at the same time, I need to let people who just want to type text leave comments.

The trick then is to turn plain text into HTML, but also allow some HTML through. Because the world is a nasty place, this means whitelisting based on tags and attributes, rather than removing known-to-be-nasty things. Glossing over the ‘turn plain text into HTML‘ part, because it’s easy, here’s how I use BeautifulSoup to sanitise HTML comments, permitting only a subset of allowed tags and attributes:

# Assume some evil HTML is in 'evil_html'

# allow these tags. Other tags are removed, but their child elements remain
whitelist = ['blockquote', 'em', 'i', 'img', 'strong', 'u', 'a', 'b', "p", "br", "code", "pre" ]

# allow only these attributes on these tags. No other tags are allowed any attributes.
attr_whitelist = { 'a':['href','title','hreflang'], 'img':['src', 'width', 'height', 'alt', 'title'] }

# remove these tags, complete with contents.
blacklist = [ 'script', 'style' ]

attributes_with_urls = [ 'href', 'src' ]

# BeautifulSoup is catching out-of-order and unclosed tags, so markup
# can't leak out of comments and break the rest of the page.
soup = BeautifulSoup(evil_html)

# now strip HTML we don't like.
for tag in soup.findAll():
    if tag.name.lower() in blacklist:
        # blacklisted tags are removed in their entirety
        tag.extract()
    elif tag.name.lower() in whitelist:
        # tag is allowed. Make sure all the attributes are allowed.
        for attr in tag.attrs:
            # allowed attributes are whitelisted per-tag
            if tag.name.lower() in attr_whitelist and attr[0].lower() in attr_whitelist[ tag.name.lower() ]:
                # some attributes contain urls..
                if attr[0].lower() in attributes_with_urls:
                    # ..make sure they're nice urls
                    if not re.match(r'(https?|ftp)://', attr[1].lower()):
                        tag.attrs.remove( attr )

                # ok, then
                pass
            else:
                # not a whitelisted attribute. Remove it.
                tag.attrs.remove( attr )
    else:
        # not a whitelisted tag. I'd like to remove it from the tree
        # and replace it with its children. But that's hard. It's much
        # easier to just replace it with an empty span tag.
        tag.name = "span"
        tag.attrs = []

# stringify back again
safe_html = unicode(soup)

# HTML comments can contain executable scripts, depending on the browser, so we'll  
# be paranoid and just get rid of all of them  
# e.g. <!--[if lt IE 7]><script type="text/javascript">h4x0r();</script><![endif]-->  
# TODO - I rather suspect that this is the weakest part of the operation..
safe_html = re.sub(r'<!--[.\n]*?-->','',safe_html)

It’s based on an Hpricot HTML sanitizer that I’ve used in a few things.

Update 2008-05-23: My thanks to Paul Hammond and Mark Fowler, who pointed me at all manner of nasty things (such as javascript: urls ) that I didn’t handle very well. I now also whitelist allowed URIs. I should also point out the test suite I use - all code needs tests!

Beautiful Soup: We called him Tortoise because he taught us.

Beautiful Soup: We called him Tortoise because he taught us.

created 18 May 2008 in links tagged html, parser and python.

Lovely HTML parser for Python. I’m using it to sanitize comments - sample code soon.

http://www.crummy.com/software/BeautifulSoup/

as days pass by » Blog Archive » DOMContentLoaded for IE, Safari, everything, without document.write

as days pass by  » Blog Archive   » DOMContentLoaded for IE, Safari, everything, without document.write

created 26 September 2007 in links tagged ajax, embed, html and javascript.

Useful snipped to fire an event once the DOM is loaded. Cross-browser and doesn’t need a large library.

http://www.kryogenix.org/days/2007/09/26/shortloaded

24 ways: Swooshy Curly Quotes Without Images

24 ways: Swooshy Curly Quotes Without Images

created 30 May 2007 in links tagged css, html and quotes.

pretty blockquotes with CSS

http://24ways.org/2005/swooshy-curly-quotes-without-images

The Elements of Typographic Style Applied to the Web - a practical guide to web typography

The Elements of Typographic Style Applied to the Web - a practical guide to web typography

created 09 December 2005 in links tagged html, toread, typography and web.

http://webtypography.net/