File: //lib/python3/dist-packages/html5lib/__pycache__/_tokenizer.cpython-310.pyc
o
V=�^�, � @ s� d dl mZmZmZ d dlmZ d dlmZm Z d dl
mZ ddlm
Z
ddlmZ ddlmZmZ dd lmZmZmZ dd
lmZmZ ddlmZ ddlmZ dd
lmZ ee�ZedkraeZne ZG dd� de�ZdS )� )�absolute_import�division�unicode_literals)�unichr)�deque�OrderedDict)�version_info� )�spaceCharacters)�entities)�asciiLetters�asciiUpper2Lower)�digits� hexDigits�EOF)�
tokenTypes�
tagTokenTypes)�replacementCharacters)�HTMLInputStream)�Trie)� � c sd e Zd ZdZd�� fdd� Zdd� Zdd� Zd�d
d�Zdd
� Zdd� Z dd� Z
dd� Zdd� Zdd� Z
dd� Zdd� Zdd� Zdd� Zd d!� Zd"d#� Zd$d%� Zd&d'� Zd(d)� Zd*d+� Zd,d-� Zd.d/� Zd0d1� Zd2d3� Zd4d5� Zd6d7� Zd8d9� Zd:d;� Zd<d=� Z d>d?� Z!d@dA� Z"dBdC� Z#dDdE� Z$dFdG� Z%dHdI� Z&dJdK� Z'dLdM� Z(dNdO� Z)dPdQ� Z*dRdS� Z+dTdU� Z,dVdW� Z-dXdY� Z.dZd[� Z/d\d]� Z0d^d_� Z1d`da� Z2dbdc� Z3ddde� Z4dfdg� Z5dhdi� Z6djdk� Z7dldm� Z8dndo� Z9dpdq� Z:drds� Z;dtdu� Z<dvdw� Z=dxdy� Z>dzd{� Z?d|d}� Z@d~d� ZAd�d�� ZBd�d�� ZCd�d�� ZDd�d�� ZEd�d�� ZFd�d�� ZGd�d�� ZHd�d�� ZId�d�� ZJd�d�� ZKd�d�� ZL� ZMS )��
HTMLTokenizera This class takes care of tokenizing HTML.
* self.currentToken
Holds the token that is currently being processed.
* self.state
Holds a reference to the method to be invoked... XXX
* self.stream
Points to HTMLInputStream object.
Nc sJ t |fi |��| _|| _d| _g | _| j| _d| _d | _t t
| ��� d S �NF)r �stream�parser�
escapeFlag�
lastFourChars� dataState�state�escape�currentToken�superr �__init__)�selfr r �kwargs�� __class__� �5/usr/lib/python3/dist-packages/html5lib/_tokenizer.pyr# ( s zHTMLTokenizer.__init__c c sf � t g �| _| �� r1| jjrtd | jj�d�d�V | jjs| jr+| j�� V | js"| �� s
dS dS )z� This is where the magic happens.
We do our usually processing through the states and when we have a token
to return we yield the token which pauses processing until the next token
is requested.
�
ParseErrorr ��type�dataN)r �
tokenQueuer r �errorsr �pop�popleft�r$ r( r( r) �__iter__7 s �
���zHTMLTokenizer.__iter__c C s� t }d}|r
t}d}g }| j�� }||v r+|tur+|�|� | j�� }||v r+|tustd�|�|�}|tv rJt| }| j �t
d dd|id�� n�d| krTd ksYn |d
krjd}| j �t
d dd|id�� nfd| krtd
ks�n d| krdks�n d| kr�dks�n d| kr�dks�n |tg d��v r�| j �t
d dd|id�� zt|�}W n t
y� |d }td|d? B �td|d@ B � }Y nw |dkr�| j �t
d dd�� | j�|� |S )z�This function returns either U+FFFD or the character based on the
decimal or hexadecimal representation. It also discards ";" if present.
If not present self.tokenQueue.append({"type": tokenTypes["ParseError"]}) is invoked.
�
� � r* z$illegal-codepoint-for-numeric-entity� charAsInt�r, r- �datavarsi � i�� � � �r � � � � � i� i� )#� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i�� i��
i��
i�� i�� i�� i�� i��
i��
i�� i�� i�� i�� i�� r: i i � i� �;z numeric-entity-without-semicolonr+ )r r r �charr �append�int�joinr r. r � frozenset�chr�
ValueError�unget) r$ �isHex�allowed�radix� charStack�cr7 rC �vr( r( r) �consumeNumberEntityG s\
�
�
� �$��z!HTMLTokenizer.consumeNumberEntityFc C s� d}| j �� g}|d tv s!|d tddfv s!|d ur+||d kr+| j �|d � �n|d dkr�d}|�| j �� � |d dv rKd}|�| j �� � |rS|d tv s[|si|d tv ri| j �|d � | �|�}n�| j �t
d d
d�� | j �|�� � dd�|� }n�|d tur�t
�d�|��s�n|�| j �� � |d tus�zt
�d�|d d� ��}t|�}W n ty� d }Y nw |d u�r|d d
kr�| j �t
d dd�� |d d
kr�|r�|| tv s�|| tv s�|| dkr�| j �|�� � dd�|� }n2t| }| j �|�� � |d�||d � �7 }n| j �t
d dd�� | j �|�� � dd�|� }|�rC| jd d d |7 <