kopia lustrzana https://gitlab.com/jaywink/federation
Switch to BeautifulSoup for content processing. Outbound rendered content is now provided by the client app. Mark inbound AP HTML content hashtags and mentions. Fix missing href attribute crashing process_text_links.
rodzic
e94533b222
commit
e0993a7f7f
|
@ -22,7 +22,7 @@
|
|||
* For inbound payload, a cached dict of all the defined AP extensions is merged with each incoming LD context.
|
||||
|
||||
* Better handle conflicting property defaults by having `get_base_attributes` return only attributes that
|
||||
are not empty (or bool). This helps distinguishing between `marshmallow.missing` and empty values.
|
||||
are not empty (or bool). This helps distinguish between `marshmallow.missing` and empty values.
|
||||
|
||||
* JsonLD document caching now set in `activitypub/__init__.py`.
|
||||
|
||||
|
@ -45,6 +45,8 @@
|
|||
|
||||
* In fetch_document: if response.encoding is not set, default to utf-8.
|
||||
|
||||
* Fix process_text_links that would crash on `a` tags with no `href` attribute.
|
||||
|
||||
## [0.24.1] - 2023-03-18
|
||||
|
||||
### Fixed
|
||||
|
|
|
@ -4,9 +4,8 @@ Protocols
|
|||
Currently three protocols are being focused on.
|
||||
|
||||
* Diaspora is considered to be stable with most of the protocol implemented.
|
||||
* ActivityPub support should be considered as alpha - all the basic
|
||||
things work but there are likely to be a lot of compatibility issues with other ActivityPub
|
||||
implementations.
|
||||
* ActivityPub support should be considered as beta - all the basic
|
||||
things work and we are fixing incompatibilities as they are identified.
|
||||
* Matrix support cannot be considered usable as of yet.
|
||||
|
||||
For example implementations in real life projects check :ref:`example-projects`.
|
||||
|
@ -69,20 +68,21 @@ Content media type
|
|||
The following keys will be set on the entity based on the ``source`` property existing:
|
||||
|
||||
* if the object has an ``object.source`` property:
|
||||
* ``_media_type`` will be the source media type
|
||||
* ``_rendered_content`` will be the object ``content``
|
||||
* ``_media_type`` will be the source media type (only text/markdown is supported).
|
||||
* ``rendered_content`` will be the object ``content``
|
||||
* ``raw_content`` will be the source ``content``
|
||||
* if the object has no ``object.source`` property:
|
||||
* ``_media_type`` will be ``text/html``
|
||||
* ``_rendered_content`` will be the object ``content``
|
||||
* ``raw_content`` will object ``content`` run through a HTML2Markdown renderer
|
||||
* ``rendered_content`` will be the object ``content``
|
||||
* ``raw_content`` will be empty
|
||||
|
||||
The ``contentMap`` property is processed but content language selection is not implemented yet.
|
||||
|
||||
For outbound entities, ``raw_content`` is expected to be in ``text/markdown``,
|
||||
specifically CommonMark. When sending payloads, ``raw_content`` will be rendered via
|
||||
the ``commonmark`` library into ``object.content``. The original ``raw_content``
|
||||
will be added to the ``object.source`` property.
|
||||
specifically CommonMark. The client applications are expected to provide the
|
||||
rendered content for protocols that require it (e.g. ActivityPub).
|
||||
When sending payloads, ``object.contentMap`` will be set to ``rendered_content``
|
||||
and ``raw_content`` will be added to the ``object.source`` property.
|
||||
|
||||
Medias
|
||||
......
|
||||
|
@ -98,6 +98,19 @@ support from client applications.
|
|||
For inbound entities we do this automatically by not including received image attachments in
|
||||
the entity ``_children`` attribute. Audio and video are passed through the client application.
|
||||
|
||||
Hashtags and mentions
|
||||
.....................
|
||||
|
||||
For outbound payloads, client applications must add/set the hashtag/mention value to
|
||||
the ``class`` attribute of rendered content linkified hashtags/mentions. These will be
|
||||
used to help build the corresponding ``Hashtag`` and ``Mention`` objects.
|
||||
|
||||
For inbound payloads, if a markdown source is provided, hashtags/mentions will be extracted
|
||||
through the same method used for Diaspora. If only HTML content is provided, the ``a`` tags
|
||||
will be marked with a ``data-[hashtag|mention]`` attribute (based on the provided Hashtag/Mention
|
||||
objects) to facilitate the ``href`` attribute modifications lient applications might
|
||||
wish to make. This should ensure links can be replaced regardless of how the HTML is structured.
|
||||
|
||||
.. _matrix:
|
||||
|
||||
Matrix
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import copy
|
||||
import json
|
||||
import logging
|
||||
import re
|
||||
import traceback
|
||||
import uuid
|
||||
from datetime import timedelta
|
||||
|
@ -8,6 +9,7 @@ from typing import List, Dict, Union
|
|||
from urllib.parse import urlparse
|
||||
|
||||
import bleach
|
||||
from bs4 import BeautifulSoup
|
||||
from calamus import fields
|
||||
from calamus.schema import JsonLDAnnotation, JsonLDSchema, JsonLDSchemaOpts
|
||||
from calamus.utils import normalize_value
|
||||
|
@ -731,15 +733,19 @@ class Note(Object, RawContentMixin):
|
|||
|
||||
_cached_raw_content = ''
|
||||
_cached_children = []
|
||||
_soup = None
|
||||
signable = True
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
self.tag_objects = [] # mutable objects...
|
||||
super().__init__(*args, **kwargs)
|
||||
self.raw_content # must be "primed" with source property for inbound payloads
|
||||
self.rendered_content # must be "primed" with content_map property for inbound payloads
|
||||
self._allowed_children += (base.Audio, base.Video, Link)
|
||||
self._required.remove('raw_content')
|
||||
self._required += ['rendered_content']
|
||||
|
||||
def to_as2(self):
|
||||
self.sensitive = 'nsfw' in self.tags
|
||||
self.url = self.id
|
||||
|
||||
edited = False
|
||||
|
@ -767,8 +773,8 @@ class Note(Object, RawContentMixin):
|
|||
|
||||
def to_base(self):
|
||||
kwargs = get_base_attributes(self, keep=(
|
||||
'_mentions', '_media_type', '_rendered_content', '_source_object',
|
||||
'_cached_children', '_cached_raw_content'))
|
||||
'_mentions', '_media_type', '_source_object',
|
||||
'_cached_children', '_cached_raw_content', '_soup'))
|
||||
entity = Comment(**kwargs) if getattr(self, 'target_id') else Post(**kwargs)
|
||||
# Plume (and maybe other platforms) send the attrbutedTo field as an array
|
||||
if isinstance(entity.actor_id, list): entity.actor_id = entity.actor_id[0]
|
||||
|
@ -779,6 +785,7 @@ class Note(Object, RawContentMixin):
|
|||
def pre_send(self) -> None:
|
||||
"""
|
||||
Attach any embedded images from raw_content.
|
||||
Add Hashtag and Mention objects (the client app must define the class tag/mention property)
|
||||
"""
|
||||
super().pre_send()
|
||||
self._children = [
|
||||
|
@ -789,135 +796,128 @@ class Note(Object, RawContentMixin):
|
|||
) for image in self.embedded_images
|
||||
]
|
||||
|
||||
# Add other AP objects
|
||||
self.extract_mentions()
|
||||
self.content_map = {'orig': self.rendered_content}
|
||||
self.add_mention_objects()
|
||||
self.add_tag_objects()
|
||||
# Add Hashtag objects
|
||||
for el in self._soup('a', attrs={'class':'hashtag'}):
|
||||
self.tag_objects.append(Hashtag(
|
||||
href = el.attrs['href'],
|
||||
name = el.text.lstrip('#')
|
||||
))
|
||||
if el.text == '#nsfw': self.sensitive = True
|
||||
|
||||
# Add Mention objects
|
||||
mentions = []
|
||||
for el in self._soup('a', attrs={'class':'mention'}):
|
||||
mentions.append(el.text.lstrip('@'))
|
||||
|
||||
mentions.sort()
|
||||
for mention in mentions:
|
||||
if validate_handle(mention):
|
||||
profile = get_profile(finger=mention)
|
||||
# only add AP profiles mentions
|
||||
if getattr(profile, 'id', None):
|
||||
self.tag_objects.append(Mention(href=profile.id, name='@'+mention))
|
||||
# some platforms only render diaspora style markdown if it is available
|
||||
self.source['content'] = self.source['content'].replace(mention, '{' + mention + '}')
|
||||
|
||||
|
||||
def post_receive(self) -> None:
|
||||
"""
|
||||
Make linkified tags normal tags.
|
||||
Mark linkified tags and mentions with a data-{mention, tag} attribute.
|
||||
"""
|
||||
super().post_receive()
|
||||
|
||||
if not self.raw_content or self._media_type == "text/markdown":
|
||||
if self._media_type == "text/markdown":
|
||||
# Skip when markdown
|
||||
return
|
||||
|
||||
hrefs = []
|
||||
for tag in self.tag_objects:
|
||||
if isinstance(tag, Hashtag):
|
||||
if tag.href is not missing:
|
||||
hrefs.append(tag.href.lower())
|
||||
elif tag.id is not missing:
|
||||
hrefs.append(tag.id.lower())
|
||||
# noinspection PyUnusedLocal
|
||||
def remove_tag_links(attrs, new=False):
|
||||
# Hashtag object hrefs
|
||||
href = (None, "href")
|
||||
url = attrs.get(href, "").lower()
|
||||
if url in hrefs:
|
||||
return
|
||||
# one more time without the query (for pixelfed)
|
||||
parsed = urlparse(url)
|
||||
url = f'{parsed.scheme}://{parsed.netloc}{parsed.path}'
|
||||
if url in hrefs:
|
||||
return
|
||||
|
||||
# Mastodon
|
||||
rel = (None, "rel")
|
||||
if attrs.get(rel) == "tag":
|
||||
return
|
||||
|
||||
# Friendica
|
||||
if attrs.get(href, "").endswith(f'tag={attrs.get("_text")}'):
|
||||
return
|
||||
|
||||
return attrs
|
||||
|
||||
self.raw_content = bleach.linkify(
|
||||
self.raw_content,
|
||||
callbacks=[remove_tag_links],
|
||||
parse_email=False,
|
||||
skip_tags=["code", "pre"],
|
||||
)
|
||||
self._find_and_mark_hashtags()
|
||||
self._find_and_mark_mentions()
|
||||
|
||||
if getattr(self, 'target_id'): self.entity_type = 'Comment'
|
||||
|
||||
def add_tag_objects(self) -> None:
|
||||
"""
|
||||
Populate tags to the object.tag list.
|
||||
"""
|
||||
try:
|
||||
from federation.utils.django import get_configuration
|
||||
config = get_configuration()
|
||||
except ImportError:
|
||||
tags_path = None
|
||||
else:
|
||||
if config["tags_path"]:
|
||||
tags_path = f"{config['base_url']}{config['tags_path']}"
|
||||
else:
|
||||
tags_path = None
|
||||
for tag in self.tags:
|
||||
_tag = Hashtag(name=f'#{tag}')
|
||||
if tags_path:
|
||||
_tag.href = tags_path.replace(":tag:", tag)
|
||||
self.tag_objects.append(_tag)
|
||||
def _find_and_mark_hashtags(self):
|
||||
hrefs = set()
|
||||
for tag in self.tag_objects:
|
||||
if isinstance(tag, Hashtag):
|
||||
if tag.href is not missing:
|
||||
hrefs.add(tag.href.lower())
|
||||
# Some platforms use id instead of href...
|
||||
elif tag.id is not missing:
|
||||
hrefs.add(tag.id.lower())
|
||||
|
||||
def add_mention_objects(self) -> None:
|
||||
"""
|
||||
Populate mentions to the object.tag list.
|
||||
"""
|
||||
if len(self._mentions):
|
||||
mentions = list(self._mentions)
|
||||
mentions.sort()
|
||||
for mention in mentions:
|
||||
if validate_handle(mention):
|
||||
profile = get_profile(finger=mention)
|
||||
# only add AP profiles mentions
|
||||
if getattr(profile, 'id', None):
|
||||
self.tag_objects.append(Mention(href=profile.id, name='@'+mention))
|
||||
# some platforms only render diaspora style markdown if it is available
|
||||
self.source['content'] = self.source['content'].replace(mention, '{'+mention+'}')
|
||||
for link in self._soup.find_all('a', href=True):
|
||||
parsed = urlparse(link['href'].lower())
|
||||
# remove the query part, if any
|
||||
url = f'{parsed.scheme}://{parsed.netloc}{parsed.path}'
|
||||
links = {link['href'].lower(), url}
|
||||
if links.intersection(hrefs):
|
||||
link['data-hashtag'] = link.text.lstrip('#').lower()
|
||||
|
||||
def _find_and_mark_mentions(self):
|
||||
mentions = [mention for mention in self.tag_objects if isinstance(mention, Mention)]
|
||||
hrefs = [mention.href for mention in mentions]
|
||||
# add Mastodon's form
|
||||
hrefs.extend([re.sub(r'/(users/)([\w]+)$', r'/@\2', href) for href in hrefs])
|
||||
for href in hrefs:
|
||||
links = self._soup.find_all(href=href)
|
||||
for link in links:
|
||||
profile = get_profile_or_entity(fid=link['href'])
|
||||
if profile:
|
||||
link['data-mention'] = profile.finger
|
||||
self._mentions.add(profile.finger)
|
||||
|
||||
def extract_mentions(self):
|
||||
"""
|
||||
Extract mentions from the source object.
|
||||
"""
|
||||
super().extract_mentions()
|
||||
Extract mentions from the inbound Mention objects.
|
||||
|
||||
if getattr(self, 'tag_objects', None):
|
||||
#tag_objects = self.tag_objects if isinstance(self.tag_objects, list) else [self.tag_objects]
|
||||
for tag in self.tag_objects:
|
||||
if isinstance(tag, Mention):
|
||||
profile = get_profile_or_entity(fid=tag.href)
|
||||
handle = getattr(profile, 'finger', None)
|
||||
if handle: self._mentions.add(handle)
|
||||
Also attempt to extract from raw_content if available
|
||||
"""
|
||||
|
||||
if self.raw_content:
|
||||
super().extract_mentions()
|
||||
return
|
||||
|
||||
for mention in self.tag_objects:
|
||||
if isinstance(mention, Mention):
|
||||
profile = get_profile_or_entity(fid=mention.href)
|
||||
handle = getattr(profile, 'finger', None)
|
||||
if handle: self._mentions.add(handle)
|
||||
|
||||
@property
|
||||
def raw_content(self):
|
||||
|
||||
if self._cached_raw_content: return self._cached_raw_content
|
||||
def rendered_content(self):
|
||||
if self._soup: return str(self._soup)
|
||||
content = ''
|
||||
if self.content_map:
|
||||
orig = self.content_map.pop('orig')
|
||||
if len(self.content_map.keys()) > 1:
|
||||
logger.warning('Language selection not implemented, falling back to default')
|
||||
self._rendered_content = orig.strip()
|
||||
content = orig.strip()
|
||||
else:
|
||||
self._rendered_content = orig.strip() if len(self.content_map.keys()) == 0 else next(iter(self.content_map.values())).strip()
|
||||
content = orig.strip() if len(self.content_map.keys()) == 0 else next(iter(self.content_map.values())).strip()
|
||||
self.content_map['orig'] = orig
|
||||
# to allow for posts/replies with medias only.
|
||||
if not content: content = "<div></div>"
|
||||
self._soup = BeautifulSoup(content, 'html.parser')
|
||||
return str(self._soup)
|
||||
|
||||
@rendered_content.setter
|
||||
def rendered_content(self, value):
|
||||
if not value: return
|
||||
self._soup = BeautifulSoup(value, 'html.parser')
|
||||
self.content_map = {'orig': value}
|
||||
|
||||
@property
|
||||
def raw_content(self):
|
||||
if self._cached_raw_content: return self._cached_raw_content
|
||||
|
||||
if isinstance(self.source, dict) and self.source.get('mediaType') == 'text/markdown':
|
||||
self._media_type = self.source['mediaType']
|
||||
self._cached_raw_content = self.source.get('content').strip()
|
||||
else:
|
||||
self._media_type = 'text/html'
|
||||
self._cached_raw_content = ""
|
||||
return self._cached_raw_content
|
||||
|
||||
if isinstance(self.source, dict) and self.source.get('mediaType') == 'text/markdown':
|
||||
self._media_type = self.source['mediaType']
|
||||
self._cached_raw_content = self.source.get('content').strip()
|
||||
else:
|
||||
self._media_type = 'text/html'
|
||||
self._cached_raw_content = self._rendered_content
|
||||
# to allow for posts/replies with medias only.
|
||||
if not self._cached_raw_content: self._cached_raw_content = "<div></div>"
|
||||
return self._cached_raw_content
|
||||
|
||||
@raw_content.setter
|
||||
def raw_content(self, value):
|
||||
if not value: return
|
||||
|
@ -1026,7 +1026,7 @@ class Video(Document, base.Video):
|
|||
self.actor_id = new_act[0]
|
||||
|
||||
entity = Post(**get_base_attributes(self,
|
||||
keep=('_mentions', '_media_type', '_rendered_content',
|
||||
keep=('_mentions', '_media_type', '_soup',
|
||||
'_cached_children', '_cached_raw_content', '_source_object')))
|
||||
set_public(entity)
|
||||
return entity
|
||||
|
@ -1330,14 +1330,16 @@ def extract_and_validate(entity):
|
|||
entity._source_protocol = "activitypub"
|
||||
# Extract receivers
|
||||
entity._receivers = extract_receivers(entity)
|
||||
|
||||
# Extract mentions
|
||||
if hasattr(entity, "extract_mentions"):
|
||||
entity.extract_mentions()
|
||||
|
||||
if hasattr(entity, "post_receive"):
|
||||
entity.post_receive()
|
||||
|
||||
if hasattr(entity, 'validate'): entity.validate()
|
||||
|
||||
# Extract mentions
|
||||
if hasattr(entity, "extract_mentions"):
|
||||
entity.extract_mentions()
|
||||
|
||||
|
||||
def extract_replies(replies):
|
||||
|
|
|
@ -4,12 +4,13 @@ import re
|
|||
import warnings
|
||||
from typing import List, Set, Union, Dict, Tuple
|
||||
|
||||
from bs4 import BeautifulSoup
|
||||
from commonmark import commonmark
|
||||
from marshmallow import missing
|
||||
|
||||
from federation.entities.activitypub.enums import ActivityType
|
||||
from federation.entities.utils import get_name_for_profile, get_profile
|
||||
from federation.utils.text import process_text_links, find_tags
|
||||
from federation.utils.text import process_text_links, find_elements, find_tags, MENTION_PATTERN
|
||||
|
||||
|
||||
class BaseEntity:
|
||||
|
@ -22,6 +23,7 @@ class BaseEntity:
|
|||
_source_object: Union[str, Dict] = None
|
||||
_sender: str = ""
|
||||
_sender_key: str = ""
|
||||
_tags: Set = None
|
||||
# ActivityType
|
||||
activity: ActivityType = None
|
||||
activity_id: str = ""
|
||||
|
@ -205,7 +207,7 @@ class CreatedAtMixin(BaseEntity):
|
|||
class RawContentMixin(BaseEntity):
|
||||
_media_type: str = "text/markdown"
|
||||
_mentions: Set = None
|
||||
_rendered_content: str = ""
|
||||
rendered_content: str = ""
|
||||
raw_content: str = ""
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
|
@ -231,59 +233,22 @@ class RawContentMixin(BaseEntity):
|
|||
images.append((groups[1], groups[0] or ""))
|
||||
return images
|
||||
|
||||
@property
|
||||
def rendered_content(self) -> str:
|
||||
"""Returns the rendered version of raw_content, or just raw_content."""
|
||||
try:
|
||||
from federation.utils.django import get_configuration
|
||||
config = get_configuration()
|
||||
if config["tags_path"]:
|
||||
def linkifier(tag: str) -> str:
|
||||
return f'<a class="mention hashtag" ' \
|
||||
f' href="{config["base_url"]}{config["tags_path"].replace(":tag:", tag.lower())}" ' \
|
||||
f'rel="noopener noreferrer">' \
|
||||
f'#<span>{tag}</span></a>'
|
||||
else:
|
||||
linkifier = None
|
||||
except ImportError:
|
||||
linkifier = None
|
||||
|
||||
if self._rendered_content:
|
||||
return self._rendered_content
|
||||
elif self._media_type == "text/markdown" and self.raw_content:
|
||||
# Do tags
|
||||
_tags, rendered = find_tags(self.raw_content, replacer=linkifier)
|
||||
# Render markdown to HTML
|
||||
rendered = commonmark(rendered).strip()
|
||||
# Do mentions
|
||||
if self._mentions:
|
||||
for mention in self._mentions:
|
||||
# Diaspora mentions are linkified as mailto
|
||||
profile = get_profile(finger=mention)
|
||||
href = 'mailto:'+mention if not getattr(profile, 'id', None) else profile.id
|
||||
rendered = rendered.replace(
|
||||
"@%s" % mention,
|
||||
f'@<a class="h-card" href="{href}"><span>{mention}</span></a>',
|
||||
)
|
||||
# Finally linkify remaining URL's that are not links
|
||||
rendered = process_text_links(rendered)
|
||||
return rendered
|
||||
return self.raw_content
|
||||
|
||||
# Legacy. Keep this until tests are reworked
|
||||
@property
|
||||
def tags(self) -> List[str]:
|
||||
"""Returns a `list` of unique tags contained in `raw_content`."""
|
||||
if not self.raw_content:
|
||||
return []
|
||||
tags, _text = find_tags(self.raw_content)
|
||||
return sorted(tags)
|
||||
return
|
||||
return find_tags(self.raw_content)
|
||||
|
||||
def extract_mentions(self):
|
||||
if self._media_type != 'text/markdown': return
|
||||
matches = re.findall(r'@{?[\S ]?[^{}@]+[@;]?\s*[\w\-./@]+[\w/]+}?', self.raw_content)
|
||||
if not matches:
|
||||
if not self.raw_content:
|
||||
return
|
||||
for mention in matches:
|
||||
mentions = find_elements(
|
||||
BeautifulSoup(
|
||||
commonmark(self.raw_content, ignore_html_blocks=True), 'html.parser'),
|
||||
MENTION_PATTERN)
|
||||
for ns in mentions:
|
||||
mention = ns.text
|
||||
handle = None
|
||||
splits = mention.split(";")
|
||||
if len(splits) == 1:
|
||||
|
@ -292,11 +257,12 @@ class RawContentMixin(BaseEntity):
|
|||
handle = splits[1].strip(' }')
|
||||
if handle:
|
||||
self._mentions.add(handle)
|
||||
self.raw_content = self.raw_content.replace(mention, '@'+handle)
|
||||
self.raw_content = self.raw_content.replace(mention, '@' + handle)
|
||||
|
||||
|
||||
class OptionalRawContentMixin(RawContentMixin):
|
||||
"""A version of the RawContentMixin where `raw_content` is not required."""
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
self._required.remove("raw_content")
|
||||
|
|
|
@ -123,6 +123,7 @@ class TestShareEntity:
|
|||
|
||||
|
||||
class TestRawContentMixin:
|
||||
@pytest.mark.skip
|
||||
def test_rendered_content(self, post):
|
||||
assert post.rendered_content == """<p>One more test before sleep 😅 This time with an image.</p>
|
||||
<p><img src="https://jasonrobinson.me/media/uploads/2020/12/27/1b2326c6-554c-4448-9da3-bdacddf2bb77.jpeg" alt=""></p>"""
|
||||
|
|
|
@ -1,11 +1,16 @@
|
|||
import re
|
||||
from typing import Set, Tuple
|
||||
from typing import Set, List
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import bleach
|
||||
from bleach import callbacks
|
||||
from bs4 import BeautifulSoup
|
||||
from bs4.element import NavigableString
|
||||
from commonmark import commonmark
|
||||
|
||||
ILLEGAL_TAG_CHARS = "!#$%^&*+.,@£/()=?`'\\{[]}~;:\"’”—\xa0"
|
||||
TAG_PATTERN = re.compile(r'(#[\w]+)', re.UNICODE)
|
||||
MENTION_PATTERN = re.compile(r'(@{?[\S ]?[^{}@]+[@;]?\s*[\w\-./@]+[\w/]+}?)', re.UNICODE)
|
||||
|
||||
|
||||
def decode_if_bytes(text):
|
||||
|
@ -22,67 +27,26 @@ def encode_if_text(text):
|
|||
return text
|
||||
|
||||
|
||||
def find_tags(text: str, replacer: callable = None) -> Tuple[Set, str]:
|
||||
def find_tags(text: str) -> List[str]:
|
||||
"""Find tags in text.
|
||||
|
||||
Tries to ignore tags inside code blocks.
|
||||
Ignore tags inside code blocks.
|
||||
|
||||
Optionally, if passed a "replacer", will also replace the tag word with the result
|
||||
of the replacer function called with the tag word.
|
||||
Returns a set of tags.
|
||||
|
||||
Returns a set of tags and the original or replaced text.
|
||||
"""
|
||||
found_tags = set()
|
||||
# <br> and <p> tags cause issues in us finding words - add some spacing around them
|
||||
new_text = text.replace("<br>", " <br> ").replace("<p>", " <p> ").replace("</p>", " </p> ")
|
||||
lines = new_text.splitlines(keepends=True)
|
||||
final_lines = []
|
||||
code_block = False
|
||||
final_text = None
|
||||
# Check each line separately
|
||||
for line in lines:
|
||||
final_words = []
|
||||
if line[0:3] == "```":
|
||||
code_block = not code_block
|
||||
if line.find("#") == -1 or line[0:4] == " " or code_block:
|
||||
# Just add the whole line
|
||||
final_lines.append(line)
|
||||
continue
|
||||
# Check each word separately
|
||||
words = line.split(" ")
|
||||
for word in words:
|
||||
if word.find('#') > -1:
|
||||
candidate = word.strip().strip("([]),.!?:*_%/")
|
||||
if candidate.find('<') > -1 or candidate.find('>') > -1:
|
||||
# Strip html
|
||||
candidate = bleach.clean(word, strip=True)
|
||||
# Now split with slashes
|
||||
candidates = candidate.split("/")
|
||||
to_replace = []
|
||||
for candidate in candidates:
|
||||
if candidate.startswith("#"):
|
||||
candidate = candidate.strip("#")
|
||||
if test_tag(candidate.lower()):
|
||||
found_tags.add(candidate.lower())
|
||||
to_replace.append(candidate)
|
||||
if replacer:
|
||||
tag_word = word
|
||||
try:
|
||||
for counter, replacee in enumerate(to_replace, 1):
|
||||
tag_word = tag_word.replace("#%s" % replacee, replacer(replacee))
|
||||
except Exception:
|
||||
pass
|
||||
final_words.append(tag_word)
|
||||
else:
|
||||
final_words.append(word)
|
||||
else:
|
||||
final_words.append(word)
|
||||
final_lines.append(" ".join(final_words))
|
||||
if replacer:
|
||||
final_text = "".join(final_lines)
|
||||
if final_text:
|
||||
final_text = final_text.replace(" <br> ", "<br>").replace(" <p> ", "<p>").replace(" </p> ", "</p>")
|
||||
return found_tags, final_text or text
|
||||
tags = find_elements(BeautifulSoup(commonmark(text, ignore_html_blocks=True), 'html.parser'),
|
||||
TAG_PATTERN)
|
||||
return sorted([tag.text.lstrip('#').lower() for tag in tags])
|
||||
|
||||
|
||||
def find_elements(soup: BeautifulSoup, pattern: re.Pattern) -> List[NavigableString]:
|
||||
for candidate in soup.find_all(string=True):
|
||||
if candidate.parent.name == 'code': continue
|
||||
ns = [NavigableString(r) for r in re.split(pattern, candidate.text)]
|
||||
candidate.replace_with(*ns)
|
||||
return list(soup.find_all(string=pattern))
|
||||
|
||||
|
||||
|
||||
def get_path_from_url(url: str) -> str:
|
||||
|
@ -100,7 +64,7 @@ def process_text_links(text):
|
|||
def link_attributes(attrs, new=False):
|
||||
"""Run standard callbacks except for internal links."""
|
||||
href_key = (None, "href")
|
||||
if attrs.get(href_key).startswith("/"):
|
||||
if attrs.get(href_key, "").startswith("/"):
|
||||
return attrs
|
||||
|
||||
# Run the standard callbacks
|
||||
|
|
Ładowanie…
Reference in New Issue