Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I needed an XML validation of potentially HTML 5. HTML 4 and XHTML only had a mediocre 250 or so entities, while the current draft (January 2012) has more than 2000.</p> <pre><code>GET 'http://www.w3.org/TR/html5-author/named-character-references.html' | xmllint --html --xmlout --format --noent - | egrep '&lt;code|&lt;span.*glyph' | # get only the bits we're interested in sed -e 's/.*"&gt;/__/' | # Add some "__" markers to make e.g. whitespace sed -e 's/&lt;.*/__/' | # entities work with xargs sed 's/"/\&amp;quot;/' | # xmllint output contains " which messes up xargs sed "s/'/\&amp;apos;/" | # ditto apostrophes. Make them HTML entities instead. xargs -n 2 echo | # Put the entity names and values on one line sed 's/__/&lt;!ENTITY /' | # Make a DTD sed 's/;__/ /' | sed 's/ __/"/' | sed 's/__$/"&gt;/' | egrep -v '\bapos\b|\bquot\b|\blt\b|\bgt\b|\bamp\b' # remove XML entities. </code></pre> <p>You end up with a file containing 2114 entities.</p> <pre><code>&lt;!ENTITY AElig "&amp;#xC6;"&gt; &lt;!ENTITY Aacute "&amp;#xC1;"&gt; &lt;!ENTITY Abreve "&amp;#x102;"&gt; &lt;!ENTITY Acirc "&amp;#xC2;"&gt; &lt;!ENTITY Acy "&amp;#x410;"&gt; &lt;!ENTITY Afr "&amp;#x1D504;"&gt; </code></pre> <p>Plugging this into an XML parser should allow the XML parser to resolve these character entities.</p> <p><em>Update October 2012</em>: Since the working draft now has a JSON file (yes, I'm still using regular expressions) I worked it down to a single sed:</p> <pre><code>curl -s 'http://www.w3.org/TR/html5-author/entities.json' | sed -n '/^ "&amp;/s/"&amp;\([^;"]*\)[^0-9]*\[\([0-9]*\)\].*/&lt;!ENTITY \1 "\&amp;#\2;"&gt;/p' | uniq </code></pre> <p>Of course a javascript equivalent would be a lot more robust, but not everyone has node installed. Everyone has sed, right? Random sample output:</p> <pre><code>&lt;!ENTITY subsetneqq "&amp;#10955;"&gt; &lt;!ENTITY subsim "&amp;#10951;"&gt; &lt;!ENTITY subsub "&amp;#10965;"&gt; &lt;!ENTITY subsup "&amp;#10963;"&gt; &lt;!ENTITY succapprox "&amp;#10936;"&gt; &lt;!ENTITY succ "&amp;#8827;"&gt; </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload