What do you think about Abstract Wikipedia?

flavia@lemmy.blahaj.zone · 7 months ago

What do you think about Abstract Wikipedia?

GenderNeutralBro@lemmy.sdf.org · 7 months ago

Sounds like a great idea. Plain English (or any human language) is not the best way to store information. I’ve certainly noticed mismatches between the data in different languages, or across related articles, because they don’t share the same data source.

Take a look at the article for NYC in English and French and you’ll see a bunch of data points, like total area, that are different. Not huge differences, but any difference at all is enough to demonstrate the problem. There should be one canonical source of data shared by all representations.

Wikipedia is available in hundreds of languages. Why should hundreds of editors need to update the NYC page every time a new census comes out with new population numbers? Ideally, that would require only one change to update every version of the article.

In programming, the convention is to separate the data from the presentation. In this context, plain-English is the presentation, and weaving actual data into it is sub-optimal. Something like population or area size of a city is not language-dependent, and should not be stored in a language-dependent way.

Ultimately, this is about reducing duplicate effort and maintaining data integrity.

schnurrito@discuss.tchncs.de · 7 months ago

This problem was solved in like 2012 or 2013 with the introduction of Wikidata, but not all language editions have decided to use that.

GenderNeutralBro@lemmy.sdf.org · 7 months ago

How common is it in English? I haven’t checked a lot of articles, but I did check the source of the English and French NYC articles I linked and it seems like all the information is hardcoded, not referenced from Wikidata.

schnurrito@discuss.tchncs.de · 7 months ago

I think enwiki tends to use Wikidata relatively sparingly.

rottingleaf@lemmy.zip · 7 months ago

but not all language editions have decided to use that.

Some people like their little power they call “meritocracy” to decide what belongs in the article and what doesn’t.

robotica@lemmy.world · 7 months ago

Disclaimer, I didn’t do any research on this, but what would be bad with just AI translating text, given a reliable enough AI? No code required, just plain human speech.

GenderNeutralBro@lemmy.sdf.org · 7 months ago

This will help make machine translation more reliable, ensuring that objective data does not get transformed along with the language presenting that data. It will also make it easier to test and validate the machine translators.

Any automated translations would still need to reviewed. I don’t think we will (or should) see totally automated translations in the near future, but I do think the machine translators could be a very useful tool for editors.

Language models are impressive, but they are not efficient data retrieval systems. Denny Vrandecic, the founder of Wikidata, has a couple insightful videos about this topic.

This one talks about knowledge graphs in general, from 2020: https://www.youtube.com/watch?v=Oips1aW738Q

This one is from last year and is specifically about how you could integrate LLMs with the knowledge graph to greatly increase their accuracy, utility, and efficiency: https://www.youtube.com/watch?v=WqYBx2gB6vA

I highly recommend that second video. He does a great job laying out what LLMs are efficient for, what more conventional methods are efficient for, and how you can integrate them to get the best of both worlds.

robotica@lemmy.world · 7 months ago

Thanks! I’ll come back to this thread once I read more.

PipedLinkBot@feddit.rocks · 7 months ago

Here is an alternative Piped link(s):

https://www.piped.video/watch?v=Oips1aW738Q

https://www.piped.video/watch?v=WqYBx2gB6vA

Piped is a privacy-respecting open-source alternative frontend to YouTube.

I’m open-source; check me out at GitHub.

AbouBenAdhem@lemmy.world · 7 months ago

I assume the main benefit will be for users of less-spoken languages, who currently get out-of date articles or none at all.

Lvxferre@mander.xyz · edit-2 7 months ago

but their goal is to translate Wikipedia articles to more languages by writing them in code that has a lot of linguistic information

That’ll get unruly really fast.

Languages simply don’t agree on how to split the usage of words. Or grammatical case. Or if, when and how to do agreement.

Just for the sake of example: how are they going to keep track of case in a way that doesn’t break Hindi, or Basque, or English, or Guarani? Or grammatical gender for a word like “milk”? (not even the Romance languages agree in it.) At a certain point, it gets simply easier to write the article in all those languages than to code something to make it for you.

I think that the best use scenario is to automate tidbits of highly changing data. It’s fairly limited but it could be useful.

Atemu@lemmy.ml · 7 months ago

Languages simply don’t agree on how to split the usage of words. Or grammatical case. Or if, when and how to do agreement.

Just for the sake of example: how are they going to keep track of case in a way that doesn’t break Hindi, or Basque, or English, or Guarani? Or grammatical gender for a word like “milk”? (not even the Romance languages agree in it.) At a certain point, it gets simply easier to write the article in all those languages than to code something to make it for you.

I don’t know what the WMF is planning here but what you’re pointing out is precisely what abstraction would solve.

If you had an abstract way to represent a sentence, you would be independent of any one order or case or whatever other grammatical feature. In the end you obviously do need actual sentences with these features. To get these, you’d build a mechanism that would convert the abstract sentence representation into a concrete sentences for specific languages that is correctly constructed according to those specific languages’ rules.

Same with gender. What you’d store would not be that e.g. some german sentence is talking about the feminine milk but rather that it’s talking about the abstract concept of milk. How exactly that abstract concept is represented in words would then be up to individual languages to decide.

I have absolutely no idea whether what I’m talking about here would be practical to implement but it in theory it could work.

Lvxferre@mander.xyz · 7 months ago

Abstractions are not magic, and they cannot make info appear out of nowhere. Somewhere inside that abstraction you’ll need to have the pieces of info that Spanish “leche” [milk] is feminine, that Zulu “ubisi” [milk] is class 11, that English predicative uses the ACC form, so goes on.

And you’ll need people to mark a multitude of distinctions in their sentences, when writing them down, that the abstraction layer would demand for other languages. Such as tagging the “I” in “I see a boy” as “+masculine, +older-person, +informal” so Japanese correctly conveys it as “ore” instead of “boku”, "atashi, “watashi” etc.

Even the idea of “abstract concept of milk” doesn’t work as well as it sounds like, because languages will split even the abstract concepts in different ways. For example, does the abstract concept associated with a living pig includes its flesh?

And the language itself cannot decide those things. A language is not an agent; it doesn’t “do” something. You’d need people to actively insert those pieces of info for each language, that’s perhaps doable for the most spoken ones, but those are the ones that would benefit the least from this.

Atemu@lemmy.ml · 7 months ago

Somewhere inside that abstraction you’ll need to have the pieces of info that Spanish “leche” [milk] is feminine, that Zulu “ubisi” [milk] is class 11, that English predicative uses the ACC form, so goes on.

Of course you do. The beauty of abstraction is that these language-specific parts can be factored into generic language-specific components. The information you’re actually trying to convey can be denoted without any language-specific parts or exceptions and that’s the important part for Wikipedia’s purpose of knowledge preservation and presentation.

you’ll need people to mark a multitude of distinctions in their sentences, when writing them down, that the abstraction layer would demand for other languages. Such as tagging the “I” in “I see a boy” as “+masculine, +older-person, +informal” so Japanese correctly conveys it as “ore” instead of “boku”, "atashi, “watashi” etc.

For writing a story or prose, I agree.

For the purpose of writing Wikipedia articles, this specifically and explicitly does not matter very much. Wikipedia strives to have one unified way of writing within a language. Whether the “I” is masculine or not would be a parameter that would be applied to all text equally (assuming I-narrator was the standard on Wikipedia).

Even the idea of “abstract concept of milk” doesn’t work as well as it sounds like, because languages will split even the abstract concepts in different ways. For example, does the abstract concept associated with a living pig includes its flesh?

If your article talks about the concept of a living pig in some way and in the context of that article, it doesn’t matter whether the flesh is included, then you simply use the default word/phrase that the language uses to convey the concept of a pig.

If it did matter, you’d explicitly describe the concept of “a living pig with its flesh” instead of the more generic concept of a living pig. If that happened to be the default of the target language or the target language didn’t differentiate between the two concepts, both concepts would turn into the same terms in that specific language.

The same applies to your example of the different forms of “I” in Japanese. To create an appropriate Japanese “rendering” of an abstract sentence, you’d use the abstract concept of “a nerdy shy kid refers to itself” as the i.e. the subject. The Japanese language “renderer” would turn that into a sentence like ”僕は。。。” while the English “renderer” would simply produce “I …”.

A language is not an agent; it doesn’t “do” something. You’d need people to actively insert those pieces of info for each language, that’s perhaps doable for the most spoken ones, but those are the ones that would benefit the least from this.

Yes, of course they would have to do that. The cool thing is that this it’d only have to be done once in a generic manner and from that point on you could use that definition to “render” any abstract article into any language you like.

You must also keep in mind that this effort has to be measured relative to the alternatives. In this case, the alternative is to translate each and every article and all changes done to them into every available language. At the scale of Wikipedia, that is not an easy task and it’s been made clear that that’s simply not happening.

(Okay, another alternative would be to remain on the status quo with its divergent versions of what are supposed to be the same articles containing the same information.)

Lvxferre@mander.xyz · edit-2 7 months ago

Note: I’ll clip the quotes for succinctness.

Of course you do. […]

You can’t leave those things to the abstraction layer because how different languages map abstract concepts differs, so there’s no way to factor them into generic language-specific components. The writer will need to tag things down, to minimal details, for the sake of languages that they don’t care about. It ends like that story about a map so large that it represents the terrain accurately being as big as the terrain, thus useless.

For writing a story or prose, I agree. […]

As I said in the reply to the other poster, the first pronoun is an example. This issue affects languages as a whole, and sometimes in ways that you can’t arbitrate through a fixed writing style because they convey meaning. (For example: if you don’t encode the social gender into the 3rd person pronouns, English breaks.)

If your article talks about the concept of a living pig in some way and in the context of that article, it doesn’t matter whether the flesh is included, then you simply use the default word/phrase that the language uses to convey the concept of a pig. […]

Often there’s no such thing as the “default”. The example with pig/pork is one of those cases - if whoever is writing the article doesn’t account for the fact that English uses two concepts (pig vs. pork) for what Spanish uses one (cerdo = puerco etc.), and assumes the default (“pig”), you’ll end with stuff like *“pig consumption has increased” (i.e. “pork consumption has decreased”). And the abstraction layer has no way to know if the human is talking about some living animal or its flesh.

And context doesn’t help much because pork and pigs are mentioned often in the same articles.

If it did matter, you’d explicitly describe the concept of “a living pig with its flesh” instead of the more generic concept of a living pig.

As I said in the top, you’ll end with a “map” that is as large as the “terrain”, thus useless. (Or: spending way more effort explicitly describing all concepts that it’s simply easier to translate it by hand.)

The project isn’t useless, mind you. Perhaps not surprisingly, it could be usable for small things in highly controlled situations, like tables; OP themself hinted this usage.

But as much as I avoid doing “hard” statements about future tech, I’m fairly certain that it won’t be viable as a way to write full articles in a language-agnostic way.

flavia@lemmy.blahaj.zone · 7 months ago

This is an encyclopedia, so there are no pronouns like “I”, so this simplifies this issue. The remaining ones are in the third person, and if we link them to data about the person that is referred to it would solve this. A longuist doesn’t necessarily need to know a language in order to analyze its grammar, and a lot of the work needed in Wikifunctions is like this.

Lvxferre@mander.xyz · 7 months ago

This is an encyclopedia, so there are no pronouns like “I”, so this simplifies this issue. The remaining ones are in the third person, and if we link them to data about the person that is referred to it would solve this.

The pronoun is an example. You are confusing the example with the issue.

This issue is that, if some language out there marks a distinction, whoever writes the abstract version of the text will need to mark it, as that info won’t “magically” pop out of nowhere. The issue won’t appear just in the pronouns, but every where.

A longuist [linguist] doesn’t necessarily need to know a language in order to analyze its grammar, and a lot of the work needed in Wikifunctions is like this.

Usually when you aren’t proficient in a language but still describing it, you focus on a single aspect of its grammar (for example, “unergative verbs”) and either a single variety or a group of related ones.

What the abstract version of the text would require is nowhere close to that. It’s more like demanding the linguist to output a full grammar, to usable levels, of every language found in Wikipedia, to write down a text about some asteroid, using a notation that is cross-linguistically consistent and comprehensible.

Also note that descriptions coming from linguists who are not proficient in a variety in question tend to be poorer.

Jojo@lemm.ee · 7 months ago

They’re just going to write all the articles in lojban.

Lvxferre@mander.xyz · 7 months ago

Not even that would do the trick - practical usage of Lojban heavily relies on fu’ivla, that carry with themselves the semantic scope assigned to the original words. .u’i I want to see them trying though.

Lvxferre@mander.xyz · edit-2 7 months ago

I’ll reply to myself to highlight a point, and issue a challenge for those who assume that WMF’s apparent goal - to translate Wikipedia articles to more languages by writing them in code that has a lot of linguistic information - is actually viable:

Here’s an excerpt from an actual Wikipedia article: “the solubility of these gases depending on the temperature and salinity of the water.” Show me all the linguistic information that a writer would need to input, to convey the same information, in that system idealised by the goal, in a way that would not output “then who was phone?” tier nonsense for some languages. Then I’ll show you why it would still output nonsense for some languages.

Too much work? Then feel free to do it just for “of the water”. It’s a single PP, how hard would it be? /s

Hic Rhodes, hic salta.

[Edit reason: clarity.]

abhibeckert@lemmy.world · edit-2 7 months ago

Your description doesn’t seem to match what the site does? For example the front page has a function that converts uppercase text to lowercase text.

It’s not article content - it’s an interactive utility.

flavia@lemmy.blahaj.zone · 7 months ago

The site itself is for contributors who want to create functions and write code for them. Examples of how it might be used in the future for articles:

Z11884 for articles about chemicals.
Z11302 for use in prose.

solrize@lemmy.world · 7 months ago

This sounds like more roboticication of wikipedia. Not good.