[freamon] Currently working on: cross-posts

Andrew@piefed.social · 8 months ago

[freamon] Currently working on: cross-posts

Rimu@piefed.social · edit-2 8 months ago

I’m glad you posted here first, a PR of this impact deserves some discussion. Also anything involving non-trivial database changes can be quite difficult to reverse once live data gets involved so we need to be a bit more careful.

Community table

URLs are guaranteed to be unique whereas titles of posts are typed by people so we run a decent risk of falsely detecting a cross-post. Also a title like “Why are are you interested in this?” means something different depending on the community it is posted in. We could work around this by limiting the search space for title-based detection to posts within the last few days, and only for titles that are fairly long - to increase the chance of being unique?

Also I wonder about the potential for abuse & trolling.

Actually even for urls, perhaps we need to only check for duplicates within the last few days. When someone links to the home page of a site it can be for a variety of different reasons but if it’s recent then they’re probably for the same reason.

If we only use url then we don’t need xp_indicator.

Posts

I did not know postgresql could do arrays, that’s very interesting.

I’m not concerned about being locked in to postgresql as I’m making zero effort to test PieFed on other database systems so we are probably already locked in, accidentally. I know the full text search package requires postgresql, for example.

However while I can see the appeal of array fields I’d really prefer we use a normal DB table for the cross_posts data. It seems a lot easier to query and do joins on? I’d tend to use array fields for storing lists of data rather than IDs which act as foreign keys. https://stackoverflow.com/questions/58943211/am-i-breaking-2nf-rule-for-using-array-data-type-in-postgressql

Andrew@piefed.social · 8 months ago

Oh, okay. I was only thinking of using ‘title’ for very few communities, like AskLemmy or ShowerThoughts, but I see how it could produce false positives even for those (I may also have been misled by the recent Issue into thinking title-based cross-posts happen more often than they do).

Speaking of that Issue, maybe the search for URL-based cross-posts could also happen in Redis - would be quicker, and would only be for recent stuff (depending on the expiry for how recent, of course).

Anyway, I’ll share here how I eventually got DB arrays to work, in case anyone considers it for anything else:

from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy.ext.mutable import MutableList
...
cross_posts = db.Column(MutableList.as_mutable(ARRAY(db.Integer)))

(they need to be mutable, because the DB won’t update when they’re added to, otherwise)

Fetching them is this code (called when the ‘layers’ icon is clicked):

@bp.route('/post/<int:post_id>/cross_posts', methods=['GET'])
def post_cross_posts(post_id: int):
    post = Post.query.get_or_404(post_id)
    cross_posts = Post.query.filter(Post.id.in_(post.cross_posts)).all()
    return render_template('post/post_cross_posts.html', post=post, cross_posts=cross_posts)

This isn’t as bad as that Stack Overflow post, because it’s not Joining those values with another table. The values in the array are sort-of self-references, rather than foreign keys, I think, so I assumed it’d be quicker than using another table (which would then refer back to the Post table again)

Rimu@piefed.social · 8 months ago

Oh, well, if we can use Post.id.in_(), that’s quite elegant! That goes a long way to mollifying my concerns. Let’s do it!

Andrew@piefed.social · 8 months ago

Okay. I’ll nix the xp_indicator idea (which’ll also make the code clearer), and keep plodding on.