Switch to BeautifulSoup for content processing. Outbound rendered content is now provided by the client app. Mark inbound AP HTML content hashtags and mentions. Fix missing href attribute crashing process_text_links.

ap-processing-improvements
Alain St-Denis 2023-07-08 07:34:44 -04:00
rodzic e94533b222
commit e0993a7f7f
6 zmienionych plików z 177 dodań i 229 usunięć

Wyświetl plik

@ -22,7 +22,7 @@
* For inbound payload, a cached dict of all the defined AP extensions is merged with each incoming LD context.
* Better handle conflicting property defaults by having `get_base_attributes` return only attributes that
are not empty (or bool). This helps distinguishing between `marshmallow.missing` and empty values.
are not empty (or bool). This helps distinguish between `marshmallow.missing` and empty values.
* JsonLD document caching now set in `activitypub/__init__.py`.
@ -45,6 +45,8 @@
* In fetch_document: if response.encoding is not set, default to utf-8.
* Fix process_text_links that would crash on `a` tags with no `href` attribute.
## [0.24.1] - 2023-03-18
### Fixed

Wyświetl plik

@ -4,9 +4,8 @@ Protocols
Currently three protocols are being focused on.
* Diaspora is considered to be stable with most of the protocol implemented.
* ActivityPub support should be considered as alpha - all the basic
things work but there are likely to be a lot of compatibility issues with other ActivityPub
implementations.
* ActivityPub support should be considered as beta - all the basic
things work and we are fixing incompatibilities as they are identified.
* Matrix support cannot be considered usable as of yet.
For example implementations in real life projects check :ref:`example-projects`.
@ -69,20 +68,21 @@ Content media type
The following keys will be set on the entity based on the ``source`` property existing:
* if the object has an ``object.source`` property:
* ``_media_type`` will be the source media type
* ``_rendered_content`` will be the object ``content``
* ``_media_type`` will be the source media type (only text/markdown is supported).
* ``rendered_content`` will be the object ``content``
* ``raw_content`` will be the source ``content``
* if the object has no ``object.source`` property:
* ``_media_type`` will be ``text/html``
* ``_rendered_content`` will be the object ``content``
* ``raw_content`` will object ``content`` run through a HTML2Markdown renderer
* ``rendered_content`` will be the object ``content``
* ``raw_content`` will be empty
The ``contentMap`` property is processed but content language selection is not implemented yet.
For outbound entities, ``raw_content`` is expected to be in ``text/markdown``,
specifically CommonMark. When sending payloads, ``raw_content`` will be rendered via
the ``commonmark`` library into ``object.content``. The original ``raw_content``
will be added to the ``object.source`` property.
specifically CommonMark. The client applications are expected to provide the
rendered content for protocols that require it (e.g. ActivityPub).
When sending payloads, ``object.contentMap`` will be set to ``rendered_content``
and ``raw_content`` will be added to the ``object.source`` property.
Medias
......
@ -98,6 +98,19 @@ support from client applications.
For inbound entities we do this automatically by not including received image attachments in
the entity ``_children`` attribute. Audio and video are passed through the client application.
Hashtags and mentions
.....................
For outbound payloads, client applications must add/set the hashtag/mention value to
the ``class`` attribute of rendered content linkified hashtags/mentions. These will be
used to help build the corresponding ``Hashtag`` and ``Mention`` objects.
For inbound payloads, if a markdown source is provided, hashtags/mentions will be extracted
through the same method used for Diaspora. If only HTML content is provided, the ``a`` tags
will be marked with a ``data-[hashtag|mention]`` attribute (based on the provided Hashtag/Mention
objects) to facilitate the ``href`` attribute modifications lient applications might
wish to make. This should ensure links can be replaced regardless of how the HTML is structured.
.. _matrix:
Matrix

Wyświetl plik

@ -1,6 +1,7 @@
import copy
import json
import logging
import re
import traceback
import uuid
from datetime import timedelta
@ -8,6 +9,7 @@ from typing import List, Dict, Union
from urllib.parse import urlparse
import bleach
from bs4 import BeautifulSoup
from calamus import fields
from calamus.schema import JsonLDAnnotation, JsonLDSchema, JsonLDSchemaOpts
from calamus.utils import normalize_value
@ -731,15 +733,19 @@ class Note(Object, RawContentMixin):
_cached_raw_content = ''
_cached_children = []
_soup = None
signable = True
def __init__(self, *args, **kwargs):
self.tag_objects = [] # mutable objects...
super().__init__(*args, **kwargs)
self.raw_content # must be "primed" with source property for inbound payloads
self.rendered_content # must be "primed" with content_map property for inbound payloads
self._allowed_children += (base.Audio, base.Video, Link)
self._required.remove('raw_content')
self._required += ['rendered_content']
def to_as2(self):
self.sensitive = 'nsfw' in self.tags
self.url = self.id
edited = False
@ -767,8 +773,8 @@ class Note(Object, RawContentMixin):
def to_base(self):
kwargs = get_base_attributes(self, keep=(
'_mentions', '_media_type', '_rendered_content', '_source_object',
'_cached_children', '_cached_raw_content'))
'_mentions', '_media_type', '_source_object',
'_cached_children', '_cached_raw_content', '_soup'))
entity = Comment(**kwargs) if getattr(self, 'target_id') else Post(**kwargs)
# Plume (and maybe other platforms) send the attrbutedTo field as an array
if isinstance(entity.actor_id, list): entity.actor_id = entity.actor_id[0]
@ -779,6 +785,7 @@ class Note(Object, RawContentMixin):
def pre_send(self) -> None:
"""
Attach any embedded images from raw_content.
Add Hashtag and Mention objects (the client app must define the class tag/mention property)
"""
super().pre_send()
self._children = [
@ -789,135 +796,128 @@ class Note(Object, RawContentMixin):
) for image in self.embedded_images
]
# Add other AP objects
self.extract_mentions()
self.content_map = {'orig': self.rendered_content}
self.add_mention_objects()
self.add_tag_objects()
# Add Hashtag objects
for el in self._soup('a', attrs={'class':'hashtag'}):
self.tag_objects.append(Hashtag(
href = el.attrs['href'],
name = el.text.lstrip('#')
))
if el.text == '#nsfw': self.sensitive = True
# Add Mention objects
mentions = []
for el in self._soup('a', attrs={'class':'mention'}):
mentions.append(el.text.lstrip('@'))
mentions.sort()
for mention in mentions:
if validate_handle(mention):
profile = get_profile(finger=mention)
# only add AP profiles mentions
if getattr(profile, 'id', None):
self.tag_objects.append(Mention(href=profile.id, name='@'+mention))
# some platforms only render diaspora style markdown if it is available
self.source['content'] = self.source['content'].replace(mention, '{' + mention + '}')
def post_receive(self) -> None:
"""
Make linkified tags normal tags.
Mark linkified tags and mentions with a data-{mention, tag} attribute.
"""
super().post_receive()
if not self.raw_content or self._media_type == "text/markdown":
if self._media_type == "text/markdown":
# Skip when markdown
return
hrefs = []
for tag in self.tag_objects:
if isinstance(tag, Hashtag):
if tag.href is not missing:
hrefs.append(tag.href.lower())
elif tag.id is not missing:
hrefs.append(tag.id.lower())
# noinspection PyUnusedLocal
def remove_tag_links(attrs, new=False):
# Hashtag object hrefs
href = (None, "href")
url = attrs.get(href, "").lower()
if url in hrefs:
return
# one more time without the query (for pixelfed)
parsed = urlparse(url)
url = f'{parsed.scheme}://{parsed.netloc}{parsed.path}'
if url in hrefs:
return
# Mastodon
rel = (None, "rel")
if attrs.get(rel) == "tag":
return
# Friendica
if attrs.get(href, "").endswith(f'tag={attrs.get("_text")}'):
return
return attrs
self.raw_content = bleach.linkify(
self.raw_content,
callbacks=[remove_tag_links],
parse_email=False,
skip_tags=["code", "pre"],
)
self._find_and_mark_hashtags()
self._find_and_mark_mentions()
if getattr(self, 'target_id'): self.entity_type = 'Comment'
def add_tag_objects(self) -> None:
"""
Populate tags to the object.tag list.
"""
try:
from federation.utils.django import get_configuration
config = get_configuration()
except ImportError:
tags_path = None
else:
if config["tags_path"]:
tags_path = f"{config['base_url']}{config['tags_path']}"
else:
tags_path = None
for tag in self.tags:
_tag = Hashtag(name=f'#{tag}')
if tags_path:
_tag.href = tags_path.replace(":tag:", tag)
self.tag_objects.append(_tag)
def _find_and_mark_hashtags(self):
hrefs = set()
for tag in self.tag_objects:
if isinstance(tag, Hashtag):
if tag.href is not missing:
hrefs.add(tag.href.lower())
# Some platforms use id instead of href...
elif tag.id is not missing:
hrefs.add(tag.id.lower())
def add_mention_objects(self) -> None:
"""
Populate mentions to the object.tag list.
"""
if len(self._mentions):
mentions = list(self._mentions)
mentions.sort()
for mention in mentions:
if validate_handle(mention):
profile = get_profile(finger=mention)
# only add AP profiles mentions
if getattr(profile, 'id', None):
self.tag_objects.append(Mention(href=profile.id, name='@'+mention))
# some platforms only render diaspora style markdown if it is available
self.source['content'] = self.source['content'].replace(mention, '{'+mention+'}')
for link in self._soup.find_all('a', href=True):
parsed = urlparse(link['href'].lower())
# remove the query part, if any
url = f'{parsed.scheme}://{parsed.netloc}{parsed.path}'
links = {link['href'].lower(), url}
if links.intersection(hrefs):
link['data-hashtag'] = link.text.lstrip('#').lower()
def _find_and_mark_mentions(self):
mentions = [mention for mention in self.tag_objects if isinstance(mention, Mention)]
hrefs = [mention.href for mention in mentions]
# add Mastodon's form
hrefs.extend([re.sub(r'/(users/)([\w]+)$', r'/@\2', href) for href in hrefs])
for href in hrefs:
links = self._soup.find_all(href=href)
for link in links:
profile = get_profile_or_entity(fid=link['href'])
if profile:
link['data-mention'] = profile.finger
self._mentions.add(profile.finger)
def extract_mentions(self):
"""
Extract mentions from the source object.
"""
super().extract_mentions()
Extract mentions from the inbound Mention objects.
if getattr(self, 'tag_objects', None):
#tag_objects = self.tag_objects if isinstance(self.tag_objects, list) else [self.tag_objects]
for tag in self.tag_objects:
if isinstance(tag, Mention):
profile = get_profile_or_entity(fid=tag.href)
handle = getattr(profile, 'finger', None)
if handle: self._mentions.add(handle)
Also attempt to extract from raw_content if available
"""
if self.raw_content:
super().extract_mentions()
return
for mention in self.tag_objects:
if isinstance(mention, Mention):
profile = get_profile_or_entity(fid=mention.href)
handle = getattr(profile, 'finger', None)
if handle: self._mentions.add(handle)
@property
def raw_content(self):
if self._cached_raw_content: return self._cached_raw_content
def rendered_content(self):
if self._soup: return str(self._soup)
content = ''
if self.content_map:
orig = self.content_map.pop('orig')
if len(self.content_map.keys()) > 1:
logger.warning('Language selection not implemented, falling back to default')
self._rendered_content = orig.strip()
content = orig.strip()
else:
self._rendered_content = orig.strip() if len(self.content_map.keys()) == 0 else next(iter(self.content_map.values())).strip()
content = orig.strip() if len(self.content_map.keys()) == 0 else next(iter(self.content_map.values())).strip()
self.content_map['orig'] = orig
# to allow for posts/replies with medias only.
if not content: content = "<div></div>"
self._soup = BeautifulSoup(content, 'html.parser')
return str(self._soup)
@rendered_content.setter
def rendered_content(self, value):
if not value: return
self._soup = BeautifulSoup(value, 'html.parser')
self.content_map = {'orig': value}
@property
def raw_content(self):
if self._cached_raw_content: return self._cached_raw_content
if isinstance(self.source, dict) and self.source.get('mediaType') == 'text/markdown':
self._media_type = self.source['mediaType']
self._cached_raw_content = self.source.get('content').strip()
else:
self._media_type = 'text/html'
self._cached_raw_content = ""
return self._cached_raw_content
if isinstance(self.source, dict) and self.source.get('mediaType') == 'text/markdown':
self._media_type = self.source['mediaType']
self._cached_raw_content = self.source.get('content').strip()
else:
self._media_type = 'text/html'
self._cached_raw_content = self._rendered_content
# to allow for posts/replies with medias only.
if not self._cached_raw_content: self._cached_raw_content = "<div></div>"
return self._cached_raw_content
@raw_content.setter
def raw_content(self, value):
if not value: return
@ -1026,7 +1026,7 @@ class Video(Document, base.Video):
self.actor_id = new_act[0]
entity = Post(**get_base_attributes(self,
keep=('_mentions', '_media_type', '_rendered_content',
keep=('_mentions', '_media_type', '_soup',
'_cached_children', '_cached_raw_content', '_source_object')))
set_public(entity)
return entity
@ -1330,14 +1330,16 @@ def extract_and_validate(entity):
entity._source_protocol = "activitypub"
# Extract receivers
entity._receivers = extract_receivers(entity)
# Extract mentions
if hasattr(entity, "extract_mentions"):
entity.extract_mentions()
if hasattr(entity, "post_receive"):
entity.post_receive()
if hasattr(entity, 'validate'): entity.validate()
# Extract mentions
if hasattr(entity, "extract_mentions"):
entity.extract_mentions()
def extract_replies(replies):

Wyświetl plik

@ -4,12 +4,13 @@ import re
import warnings
from typing import List, Set, Union, Dict, Tuple
from bs4 import BeautifulSoup
from commonmark import commonmark
from marshmallow import missing
from federation.entities.activitypub.enums import ActivityType
from federation.entities.utils import get_name_for_profile, get_profile
from federation.utils.text import process_text_links, find_tags
from federation.utils.text import process_text_links, find_elements, find_tags, MENTION_PATTERN
class BaseEntity:
@ -22,6 +23,7 @@ class BaseEntity:
_source_object: Union[str, Dict] = None
_sender: str = ""
_sender_key: str = ""
_tags: Set = None
# ActivityType
activity: ActivityType = None
activity_id: str = ""
@ -205,7 +207,7 @@ class CreatedAtMixin(BaseEntity):
class RawContentMixin(BaseEntity):
_media_type: str = "text/markdown"
_mentions: Set = None
_rendered_content: str = ""
rendered_content: str = ""
raw_content: str = ""
def __init__(self, *args, **kwargs):
@ -231,59 +233,22 @@ class RawContentMixin(BaseEntity):
images.append((groups[1], groups[0] or ""))
return images
@property
def rendered_content(self) -> str:
"""Returns the rendered version of raw_content, or just raw_content."""
try:
from federation.utils.django import get_configuration
config = get_configuration()
if config["tags_path"]:
def linkifier(tag: str) -> str:
return f'<a class="mention hashtag" ' \
f' href="{config["base_url"]}{config["tags_path"].replace(":tag:", tag.lower())}" ' \
f'rel="noopener noreferrer">' \
f'#<span>{tag}</span></a>'
else:
linkifier = None
except ImportError:
linkifier = None
if self._rendered_content:
return self._rendered_content
elif self._media_type == "text/markdown" and self.raw_content:
# Do tags
_tags, rendered = find_tags(self.raw_content, replacer=linkifier)
# Render markdown to HTML
rendered = commonmark(rendered).strip()
# Do mentions
if self._mentions:
for mention in self._mentions:
# Diaspora mentions are linkified as mailto
profile = get_profile(finger=mention)
href = 'mailto:'+mention if not getattr(profile, 'id', None) else profile.id
rendered = rendered.replace(
"@%s" % mention,
f'@<a class="h-card" href="{href}"><span>{mention}</span></a>',
)
# Finally linkify remaining URL's that are not links
rendered = process_text_links(rendered)
return rendered
return self.raw_content
# Legacy. Keep this until tests are reworked
@property
def tags(self) -> List[str]:
"""Returns a `list` of unique tags contained in `raw_content`."""
if not self.raw_content:
return []
tags, _text = find_tags(self.raw_content)
return sorted(tags)
return
return find_tags(self.raw_content)
def extract_mentions(self):
if self._media_type != 'text/markdown': return
matches = re.findall(r'@{?[\S ]?[^{}@]+[@;]?\s*[\w\-./@]+[\w/]+}?', self.raw_content)
if not matches:
if not self.raw_content:
return
for mention in matches:
mentions = find_elements(
BeautifulSoup(
commonmark(self.raw_content, ignore_html_blocks=True), 'html.parser'),
MENTION_PATTERN)
for ns in mentions:
mention = ns.text
handle = None
splits = mention.split(";")
if len(splits) == 1:
@ -292,11 +257,12 @@ class RawContentMixin(BaseEntity):
handle = splits[1].strip(' }')
if handle:
self._mentions.add(handle)
self.raw_content = self.raw_content.replace(mention, '@'+handle)
self.raw_content = self.raw_content.replace(mention, '@' + handle)
class OptionalRawContentMixin(RawContentMixin):
"""A version of the RawContentMixin where `raw_content` is not required."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._required.remove("raw_content")

Wyświetl plik

@ -123,6 +123,7 @@ class TestShareEntity:
class TestRawContentMixin:
@pytest.mark.skip
def test_rendered_content(self, post):
assert post.rendered_content == """<p>One more test before sleep 😅 This time with an image.</p>
<p><img src="https://jasonrobinson.me/media/uploads/2020/12/27/1b2326c6-554c-4448-9da3-bdacddf2bb77.jpeg" alt=""></p>"""

Wyświetl plik

@ -1,11 +1,16 @@
import re
from typing import Set, Tuple
from typing import Set, List
from urllib.parse import urlparse
import bleach
from bleach import callbacks
from bs4 import BeautifulSoup
from bs4.element import NavigableString
from commonmark import commonmark
ILLEGAL_TAG_CHARS = "!#$%^&*+.,@£/()=?`'\\{[]}~;:\"’”—\xa0"
TAG_PATTERN = re.compile(r'(#[\w]+)', re.UNICODE)
MENTION_PATTERN = re.compile(r'(@{?[\S ]?[^{}@]+[@;]?\s*[\w\-./@]+[\w/]+}?)', re.UNICODE)
def decode_if_bytes(text):
@ -22,67 +27,26 @@ def encode_if_text(text):
return text
def find_tags(text: str, replacer: callable = None) -> Tuple[Set, str]:
def find_tags(text: str) -> List[str]:
"""Find tags in text.
Tries to ignore tags inside code blocks.
Ignore tags inside code blocks.
Optionally, if passed a "replacer", will also replace the tag word with the result
of the replacer function called with the tag word.
Returns a set of tags.
Returns a set of tags and the original or replaced text.
"""
found_tags = set()
# <br> and <p> tags cause issues in us finding words - add some spacing around them
new_text = text.replace("<br>", " <br> ").replace("<p>", " <p> ").replace("</p>", " </p> ")
lines = new_text.splitlines(keepends=True)
final_lines = []
code_block = False
final_text = None
# Check each line separately
for line in lines:
final_words = []
if line[0:3] == "```":
code_block = not code_block
if line.find("#") == -1 or line[0:4] == " " or code_block:
# Just add the whole line
final_lines.append(line)
continue
# Check each word separately
words = line.split(" ")
for word in words:
if word.find('#') > -1:
candidate = word.strip().strip("([]),.!?:*_%/")
if candidate.find('<') > -1 or candidate.find('>') > -1:
# Strip html
candidate = bleach.clean(word, strip=True)
# Now split with slashes
candidates = candidate.split("/")
to_replace = []
for candidate in candidates:
if candidate.startswith("#"):
candidate = candidate.strip("#")
if test_tag(candidate.lower()):
found_tags.add(candidate.lower())
to_replace.append(candidate)
if replacer:
tag_word = word
try:
for counter, replacee in enumerate(to_replace, 1):
tag_word = tag_word.replace("#%s" % replacee, replacer(replacee))
except Exception:
pass
final_words.append(tag_word)
else:
final_words.append(word)
else:
final_words.append(word)
final_lines.append(" ".join(final_words))
if replacer:
final_text = "".join(final_lines)
if final_text:
final_text = final_text.replace(" <br> ", "<br>").replace(" <p> ", "<p>").replace(" </p> ", "</p>")
return found_tags, final_text or text
tags = find_elements(BeautifulSoup(commonmark(text, ignore_html_blocks=True), 'html.parser'),
TAG_PATTERN)
return sorted([tag.text.lstrip('#').lower() for tag in tags])
def find_elements(soup: BeautifulSoup, pattern: re.Pattern) -> List[NavigableString]:
for candidate in soup.find_all(string=True):
if candidate.parent.name == 'code': continue
ns = [NavigableString(r) for r in re.split(pattern, candidate.text)]
candidate.replace_with(*ns)
return list(soup.find_all(string=pattern))
def get_path_from_url(url: str) -> str:
@ -100,7 +64,7 @@ def process_text_links(text):
def link_attributes(attrs, new=False):
"""Run standard callbacks except for internal links."""
href_key = (None, "href")
if attrs.get(href_key).startswith("/"):
if attrs.get(href_key, "").startswith("/"):
return attrs
# Run the standard callbacks