Topic on Extension talk:WikibaseLexeme/Data Model

An alternative model proposal: logomer

18 comments • 17:02, 5 September 2017 7 years ago

18

Psychoslave (talkcontribs)

The more I'm reading and thinking about it, the more I'm inclined to consider that the model is trying to give a too rigid framework.

What we are interested to document in Wiktionaries, is chunks of discourses, and what is claimed about that chunks in such and such theories.

A lexeme is an abstract structure which is already far too committed into a closed theory of language, that is it doesn't provide space for presenting language analyzes which doesn't fit a lexemic structuring.

The mission of Wiktionaries is documenting all languages. Side note: this doesn't state written language, spoken language, or in fact even human languages, so depending on local consensus, you might see bee language documented.

What is aimed here, as far as I understand, is to propose a structured database model to support this aim.

So, the model must allow to document lexemes, sure. But that could be done as a lexemic relationship. For example cat and cats, Baum and Bäumen are two couples in lexemic relationships, that could be recorded as 4 distinct entities.

To really support goal of Wiktionary, the model must also allow to document w:lexical item, morphs, w:morphemes, etymoms and whatever discourse chunk a contributor might want to document and relate to other discourse chunks. A lexeme class can't do that, or you must come with such a distant definition of lexeme that it won't match any of the already too many existing one among linguistic literature.

I'm not aware of any consensual term for the "discourse chunk" in the sense I'm suggesting here (token doesn't fit either). So, in the rest of this message I'll use logomer (see wikt:en:logo- and wikt:en:-mer).

A discourse is any sign flow[note 1].

A glyph is any non-segmentable[note 2] sign that can be stored/recorded.

A logomer is a data structure which pertains to parts of a sequence of glyphes representing a discourse.

A logomer must have one or more representation.

A representation must have one or more form.

A single form must be elected as label.

A representation should indicate which representational systems it pertains to.[note 3]

A logomer must be related to one or more meaning.[note 4]

A logomer form must be extractable from a glyph sequence that represents a discourse.[note 5]

The extraction process of a logomer form must keep every unfiltered glyph.

The extraction process must not add any glyph.[note 6].

The extraction process must not alter any glyph.

A logomer form must include one or more glyph sequences (thereafter named "segment").

A segment must provide a glyph string.

A form including more than one segment must provide an ordinal for each segment.

A segment ordinal must indicate the relative position of a segment with respect to other segments of the form, relatively to the begin of discourses where it appears.

A segment might be void.

A void segment might serve as boundary marker, indicating possible positions for other segments which are not part of the current logomer.

All logomer forms of a single representation must be congruent under permutation.[note 7]

An indistinguishable logomer form might appear in multiple discourses.[note 8]

Distinct occurences of the same logomer forms with distinct meanings must induce distinct logomers.

Distinct meanings attributed to the same discourse parts should appears in a single logomer.

A logomer form might be taken as a discourse of its own.

↑ More criteria regarding meaning is purposefully set aside
↑ That is, in regard of the sign system used. For example a code point of a character encoding system could be segmented in several bits, but a bit is not a sign of the encoding system itself, even if a discourse using this system can make references to such a sign.
↑ For example, through statements. Accuracy of this information might be left to community. It could be things as vague as "casual oral retranscription" and "direct matching of written document", or more precise like "phonemic system of the International Phonetic Alphabet" and "official orthography in the Dutch spelling reform of 1996"
↑ Or definition, or whatever indication of its sense
↑ Discourses that can't be represented as a glyph sequence are not considered
↑ So boundaries markers as hyphen in morphs, like logo-, aren't part of a logomere
↑ That is, all forms have the exact same set of segments, only ordinal of this segments can change.
↑ But happaxes are logomer forms too, though

Reply 20:55, 30 August 2017 7 years ago

Psychoslave (talkcontribs)

Actually, it's not yet a fixed model, clearly. In fact I already slimmed it deeply while creating the following graphical representation:

Visualization of an alternative to the Lexem data model for a Wikibase support of Wiktionary

However it might be too slim. Maybe keeping at least one mandatory field related to meaning (but valuable with a null value) would be better, whether on the logomer, or on the logomer form.

This way it's possible to indicate a difference between "wikt:fr:grand homme" and "homme grand", the former being (in French variant I'm aware of) always used to indicate a famous person, while the later indicate a person is tall.

But I'll already wait for feedback, especially from Noé, Benoît Prieur, Delarouvraie, Lyokoï, Jberkel, psychoslave, Lydia Pintscher, Thiemo Mättig, Daniel Kinzler, Epantaleo, Ariel1024, Otourly, VIGNERON, Shavtay, TaronjaSatsuma, Rodelar, Marcmiquel, Xenophôn, Jitrixis, Xabier Cañas, Nattes à chat, LaMèreVeille, GastelEtzwane, Rich Farmbrough, Ernest-Mtl, tpt, M0tty, Nemo_bis, Pamputt, Thibaut120094, JackPotte, Trizek, Sebleouf, Kimdime, S The Singer, Amqui, LA2, Satdeep Gill, Micru, Vive la Rosière, Malaysiaboy and Stalinjeet

Reply 21:14, 30 August 2017 7 years ago

LA2 (talkcontribs)

When you float away in your abstractions, you attract dreamers who like such abstractions, but repulse people who are able to sit down and do real and concrete work. Wikipedia is a success not because it is a perfect and abstract ideal of a theoretical model of knowledge, but because it is a simple tool for processing ASCII text.

Reply 21:26, 30 August 2017 7 years ago

Lyokoï (talkcontribs)

Sorry, but I don't understand what you want to do...

Reply 22:18, 30 August 2017 7 years ago

Amqui (talkcontribs)

I am pretty confused about the intent here as well...

Reply 22:58, 30 August 2017 7 years ago

Tofeiku (talkcontribs)

I was surprised that I'm listed here but I'm pretty confused as well with this.

Reply 06:08, 31 August 2017 7 years ago

Rich Farmbrough (talkcontribs)

It is certainly true that we want to document things of unknown or even no meaning "AOI", "Nautron respoc lorni virch" or the archetypal meaningless phrases of philosophers such as "hig mig cig". Even then there is context - there is always context.

Rich Farmbrough 11:45, 31 August 2017 (UTC).

Reply 11:45, 31 August 2017 7 years ago

Psychoslave (talkcontribs)

Ok, it seems I need to explain more what I aim to provide here.

Shortly, a data structure which target carrying less abstract data but allowing relationships useful for wiktionaries.

So taking let's take the English adjective "hard" as a first example, so one might compare with current model examples.

Exemple 1: hard

In this model the string (glyph sequence) "hard" might be recorded as following:

Logomer: L64723

Statements

Label: hard (Form 1)
- (that is, the linearization of the segments, which here is a single element)
used in discourse expressed in: English (Q1860)
lexical category: adjective (Q34698)
Derived from: heard (L112) Oldenglish adjective
other statments might also add registers, glosses, definitions, synonyms, antonyms, translations, related concepts and so on

Form 1

segments: hard

- segments in detail: (0, hard)

Statements

used within representation system of: written utterences (Q98123723736661235)
prounonced as: hɑːd (L387483) (the logorem itself can indicate corresponding representation systems)
- Qualifiers:
  - Region: Scotland (Q22)
- References: ...
prounonced as: hɑɹd
- - Region: Scotland (Q22)
- References: ...
prounonced as hard.ogg
- Qualifiers:
  - Region: United States of America (Q30)

(Rhymes should be infered from associated phonetic logomers, which is even more important with cases with regional differences)

Form 2

There is no other indisputable form for hard in this model. But one might suggest that hard- in hardcore is just an other form of the present logomer. As said, that's disputable, but for the sake of the example, here is how this second affixal form would be represented with this model (so possibly in a distinct logomer):

segments: "hard", "-"

- segments in detail: (0, "hard"), (1, AGLUNATIVE__MARK)
  - 'The AGGLUNATIVE_MARK might be a special value, or a string containing a single Soft hyphen for example.

Statements

…

Exemple 2: je me mis la tête à l’envers

Now, here is a second example which doesn't come from those provided for the Lexeme model, but that might enlight what I had in mind while trying to outline a design for logomer.

So, in French, "je me mis la tête à l’envers" is an inflectionned form of the phrase "fr:se mettre la tête à l’envers". In the model of logomers, each inflection have a single separated instance. That is "je me mis la tête à l’envers", "tu te mis la tête à l’envers", and "se mettre la tête à l’envers" are three diffrent logomers. Possibly they could group common statements in an other entity, but that's an other topic.

Forms in logomers are here only to carry permutations and related statements such as grammatical acceptability in a given frame.

For example "je me mis la tête complètement à l’envers", "je me mis gravement la tête à l’envers" and "à l’envers, je me mis la tête" are all less commonly heard but grammatically acceptable to my French native mind, and clearly are using instances of "je me mis la tête à l’envers".

Thus "je me mis gravement la tête à l’envers" might be described as the following form

segments: "je me mis", " ", "la tête", " ", "à l’envers"
- segments in detail: (0, "je me mis"), (1, SPECIFIER__MARK), (2, "la tête") (3, SPECIFIER__MARK) (4, "à l’envers")
  - 'The SPECIFIER_MARK might be a concept entity such as "adjective", linearized as as simple space or "[…]" for display purposes.

And "à l’envers, je me mis la tête" might be described as the following form

segments: "à l’envers", " ", "je me mis", " ", "la tête",
- segments in detail: (0, SPECIFIER__MARK), (1, "à l’envers") , (2, "je me mis"), (3, SPECIFIER__MARK), (4, "la tête")

Note that something like "me je tête l’envers mis la à", which certainly wouldn't be recognized as grammatical for a French speaker, doesn't fit any permutation of the segments proposed here, but nothing in the model prevent to document it in an other logomer.

I hope it helps @LA2:, @Lyokoï:, @Amqui: and @Malaysiaboy: to grab my approach.

Reply 08:25, 1 September 2017 7 years ago

Lyokoï (talkcontribs)

Sorry for the french :

Attends, c'est la version wikidata de wiktionary que t'essaie de faire là, non ? Écoute, je n'ai jamais pris le temps d'y faire quoi que ce soit. Je n'y ai pas envie d'y mettre du temps et de toute façon, je pense que ce n'est pas la bonne solution. Merci de me laisser à côté de ça. Je m'y impliquerai quand j'y verrais un intérêt pour le Wiktionnaire.

Reply Edited 15:14, 1 September 2017 7 years ago

Lyokoï (talkcontribs)

(Je rajoute qu'en plus c'est en anglais, et que j'y comprend qu'à moitié...)

Reply 15:13, 1 September 2017 7 years ago

Denny (talkcontribs)

@Psychoslave, thanks for the effort in trying to create a better model. I want to point out that the current proposal for Wikidata's Lexicographic model is not just thought up by the Wikidata team, but an adaptation of lexicographic data models that have been developed over the last century starting with TC 37, later under ISO as the Lexical Markup Framework, and then captured in RDF under the Lemon model. Wikidata is very much in that tradition, which means it is distilling literally the knowledge of hundreds of linguists over a century of work.

Just to raise three points with your model:

1) whereas you claim that it also extends to Bee language, I am wondering whether this is actually a requirement. Wikidata's first (although not only) priority is to support its sister Wikimedia projects. Is there any Wiktionary that actually captures bee language? If we move to far away from our requirements we might create a solution that is more complex than necessary.

2) whereas you claim that the Bee language is a requirement, your model later is restricted to languages represented with glyphs. This seems contradictory to me? Did I miss something?

3) in your example for hard, you state that meanings and antonyms could be expressed via statements on the level of the Logomer. But antonyms are not pertaining to a specific Logomer, if I understand Logomers and antonyms correctly, but usually to a specific sense of the Logomer, i.e. to a specific definition. But I don't seem to be able to express the antonym relation on the definition. Maybe I am just missing something.

Again, thank you for your proposal. It is unfortunate that it comes so late - the discussions about the data model were held years ago, fully in the open, and with wide invitations. It is not easy to fully appreciate such a contribution just a few months before the planned roll out of the project.

Reply 22:06, 1 September 2017 7 years ago

Psychoslave (talkcontribs)

0) I read the whole talk page of Wikidata4Wiktionary, so I was aware of the important analyze work you have done and used. I didn't yet read all the documentation about Lemon though. Anyway, my concern is not about the Lemon model, or the current proposed Lexeme model as a useful tool in many context, but in the very precise context of Wikidata4Wiktionary. If tradition seems a good fit for grounding our goals, great, let's leverage our work with it. Otherwise, let's set them aside, rather than sink under the weight of its unsuitable hypothesis.

1) If that's the case, I'm not aware of it. The bee language was of course an extreme example. I'm all for a simpler model. One which remove as much as possible from any linguistic theory while letting the ability to express them through its internal mechanisms. My current proposal seems still far too complicated and confusing for other contributors, so to my mind, it is not good enough either. Sticking to our requirements is great, but what are our requirements. I didn't saw the document exposing clearly this requirements, and how they were produced, so if such a document does exist, please let me know. To my mind, the requirement 0 is a class designed to store strings, going from d to wikt:bayerischer gebirgsschweisshund, but also including affixes such as wikt:-in-, morphs, and any sequence of characters one might encounter in the world. I tried to go further with the "ordered segments" of utterance, but that's maybe already a too complex model for our goals. Then the requirement 1, is to be able to document those strings, so those who encounter them can type them in a Wiktionary and discover where it is suspected to come from, whether it might mean something, its contrary or nothing at all depending on context. Yes, even strings with no actual (publicly known) meaning is worth documenting so people who encounter them can grab the knowledge of this substantiated absence of sense. And finaly, requirement 2 is to be able to glue all this pieces through relationships, preferably in a way that allow as much automated inferences. That's the basic requirements I would expect from a Wikidata4Wiktionary.

2) I think more probable that I didn't explicated my idea clearly enough, rather than you missed something I said distinctly. So my idea is that the data model about an utterance performance, but a recordable representation of such a performance. The representation only refer to the performance. Maybe a theater analogy would be more significant here: there is a written script before the show performance and you might have a recorded video, but the performance itself is gone. So, do I think that a glyph sequence can be used to code represent a bee utterance? Yes definetly, just as w:fr:DanceWriting can be used to represent dance. I used glyph rather than character, because – at least to my mind – glyph represent a larger set. But if "character strings" is more clear, let's use that.

3) I think you have a very good point, but I'm afraid that as I'm writing this I'm far too tired to provide a relevant answer right now. So I'll delay until I had some rest, sorry.

4) Well, I'm sorry, I do agree I'm late, I did attempted to participate in the past, but never found occasion to give more feedback earlier. All the more I expanded my knowledge about linguistic and practiced in various other ways as a Wikimedian…

Reply Edited by Denny 19:05, 2 September 2017 7 years ago

Denny (talkcontribs)

0) The use case and requirements is to support the Wiktionaries. So the coverage is kinda given by "whatever the Wiktionaries have now", and the model has to be good enough to cover that. Going beyond that is nice and well, but only if it doesn't get more complicated. As simple as possible, as complex as required to serve the Wiktionaries - that's the primary requirement. If at the same time we can follow best practices from research - just as we did for Wikidata and the research in Knowledge Representation - the better - that would be the secondary requirement. So if there is a widely agreed on data model from linguistic research which at the same time fulfills the needs of the Wiktionaries, then I am super happy to just adopt it instead of invent something new. Because in this case the likelihood of third parties donating data or working with the data grows by a huge amount, since we are not inventing new stuff but building on existing stuff that is already established. This is why I think an alternative model doesn't have only to be strictly better, but strictly better by a sufficiently wide margin to jeopardize external adoption. I hope that makes any sense.

Basically, I would ask anyone who brings up an alternative model to show what exact use case in Wiktionary would not be served by the current proposed model and how their model serves it - and at the same time ensuring that all other use cases are still covered.

3) I'd be curious to hear, as I think that is one of the main use cases the data model has to fulfill.

(I'm skipping 1), 2) and 4), as I think they are not so central and won't contribute too much to a result. Let me know if you disagree)

Reply 19:05, 2 September 2017 7 years ago

Psychoslave (talkcontribs)

I'm ok with skipping 1), 2) and 4).

Regarding 3), I think that you are simply right about the flaw of the Logomer model.

I'm still wondering what is supposed to encompass in Lexeme class of the current model. Should it store affixes, stems, etymons, clitics, morphems (and possibly monemes), glossemes, compounds, lexical item, phrases, and other lexical units which don't even have English equivalent such as wikt:fr:lexie?, If so, I wonder if the term lexeme is still appropriate.

Reply 13:57, 3 September 2017 7 years ago

Psychoslave (talkcontribs)

Concerning requirements and examples of data that the model should be able to encompass, I will write a dedicated page. Maybe this week, but I'll have to allocate more time to local chapter concerns so I can promise any progress on this side for the forthcoming days.

Reply 21:32, 3 September 2017 7 years ago

Denny (talkcontribs)

I don't care that much about what the structures are named in the data model, and I wouldn't put too much weight on a definition of Lexeme - just as we never clearly defined what an Item is in Wikidata. In the end, everything that has a Wiktionary page will lead to one or more Lexemes, just as everything with a page in the other Wikimedia projects lead to Items. The important thing is, whether the structure works as a data structure - not what a Lexeme is. The word 'Lexeme' is merely a rough approximation, to convey a rough idea. 'Word' would have been equally possible, and inaccurate too - but in the end, it is just a rough, somewhat intuitive word for a data structure that needs to fulfill the requirements of the use cases.

Reply 21:22, 4 September 2017 7 years ago

Psychoslave (talkcontribs)

I don't care that much about what the structures are named in the data model: Well, it's very sad to hear you are careless about terminology, especially within a project that is aimed at helping lexicographers. If the model will keep this data structure, then definitively should use "word" instead of "lexeme".

just as we never clearly defined what an Item is in Wikidata: Isn't Wikidata glossary entry about item a clear definition? Maybe it was done as an afterthought, but it's here as far as I can see.

The important thing is, whether the structure works as a data structure - not what a Lexeme is.: The important thing is whether the structure helpful for Wikitionary contributors, and using clearly defined classes is a requirement of such a goal. Otherwise this model could just as well use "class1" instead of "lexeme", "class2" instead of "form", "attribute1" instead of lemma, and so on. As a data structure per se it would work just as well.

Word would have been equally possible, and inaccurate too - but in the end, it is just a rough, somewhat intuitive word for a data structure that needs to fulfill the requirements of the use cases.: Linguistic use "lexeme" precisely to avoid the vagueness of "word" (although depending on the linguistic school it will carry different specified meanings). Using "lexeme" is counter-intuitive, or at least, in complete opposition with the intent of the term. It favors the false impression that the model intent to treat the topic with carefully chosen terminology, when in fact it was carelessly arbitrarily elected through unspecified criteria. Also, it's seems very incompatible that on the one hand you say that the model should be erected on the solid body of linguistic knowledge founded over the last century, and on the other hand that you just don't care about using appropriate terminology regarding this same body of knowledge.

Reply Edited 09:08, 5 September 2017 7 years ago

Denny (talkcontribs)

These are good points, and in hindsight, it sounds more dismissive than I meant it to be. Yes, Item has a definition that you point to - but if you really look at Wikidata you will find that this definition is not true. There are plenty of Items which are far from fulfilling that definition. And yet it is, I think, I a good thing to have such a definition, as it helps with understanding the model. It's a Wittgenstein ladder.

The same I would hold for the terminology here. In fact, I do think that the model should work as well if we would use attribute1 instead of lemma. But the latter is helpful in discussing the UI, the code, the model. Not because it is true.

The data model must fulfill the use cases, and if it is able to model solid linguistic theories, the better. But the exact terminology should be treated as a Wittgenstein's ladder - useful to gain an intuition, and for having evocative labels, but they should (and won't) restrain the community in getting their work done. If something - like the letter 'd' - is not regarded as a Lexeme in mainstream linguistic theories, that should not (and won't) stop the community from adding the letter 'd' as a Lexeme - just as the ontological status of many, say, Wikinews articles or Wikisource pages did not stop the community from creating items for them. And that's OK.

In the end, the exact labeling of the elements of the data structure won't be as important as what the community actually does with them and how they use it inside the Wiktionaries. In fact, a lot of the terminology is even hidden from most casual contributors - they might never see the term 'Lexeme' in the first place. Just as the word 'item' is not prominent in the Wikidata UI. But it is still useful to have a shared vocabulary for development and talking about the system.

I hope that makes sense and I am not contradicting myself too much.

Reply 17:02, 5 September 2017 7 years ago

Reply to "An alternative model proposal: logomer"