Akisamb

joined 1 year ago
MODERATOR OF
[–] Akisamb@programming.dev 2 points 7 months ago (1 children)

Cette franchise doit être aussi payé par les plus pauvres ?

[–] Akisamb@programming.dev 3 points 7 months ago

This is not true in France. Politicians that have proven fraud are arrested and charged. In France we have Sarkozy, Cahuzac, Fillon that were all charged with crimes.

They were president, minister and presidential candidate respectively. I'd be surprised if it was different in the USA. I'm seeing that trump is also being charged, the system seems to be working.

[–] Akisamb@programming.dev 3 points 8 months ago

Convolutional neural networks and plant identifying apps came before chat gpt. Beyond both relying on neural networks they don't have much in common.

[–] Akisamb@programming.dev 1 points 8 months ago

Don't know why you are down voted it's a good question.

As a matter of fact it almost happened for search engines in France. Newspaper's argued that snippets were leading people to not go into their ad infested sites thus losing them revenue.

https://techcrunch.com/2020/04/09/frances-competition-watchdog-orders-google-to-pay-for-news-reuse/

[–] Akisamb@programming.dev 4 points 8 months ago

It does seem odd that scraping activity from just two accounts allegedly managed to cause such an extended server outage. The irony of this situation also hasn’t been lost on online creatives, who have extensively criticized both companies (and generative AI systems in general) for training their models on masses of online data scraped from their works without consent. Stable Diffusion and Midjourney have both been targeted with several copyright lawsuits, with the latter being accused of creating an artist database for training purposes in December.

As far as I know they do not have copyright over the output of their models. Apart from banning the users they pretty much have no solutions to stop this. Even if they had copyright, it's still legally unknown if training LLMs constitutes a copyright violation.

In a similar fashion a lot of the recent chat llm's have been trained on output from chatgpt. After all why pay humans to produce training data when your competitor has already done it for you.

[–] Akisamb@programming.dev 1 points 8 months ago (2 children)

Why would java have an impact on battery performance ? Pretty much all credit cards run java for their encryption algorithms, and they need pretty much no power to run.

[–] Akisamb@programming.dev -4 points 9 months ago

I don't agree. Curvy roads are dangerous, but there are much more conflicts in cities. You're not going to have many pedestrians in curvy mountain roads.

That said, you are right that the ideal comparison would be int the same city. But I'm not sure that the data exists, I'll have to look this afternoon.

That said, even if my data is not perfect, it's much better than taking one accident and saying that self driving cars are dangerous. They are not going to be magically better than humans, after all driving is a difficult task, but we should at least crunch the numbers before dismissing them.

[–] Akisamb@programming.dev -3 points 9 months ago (2 children)

You can't take one accident and use that to generalize.

You need to take into account all accidents and see how worse humans are.

https://arstechnica.com/cars/2023/12/human-drivers-crash-a-lot-more-than-waymos-software-data-shows/

Cars are naturally dangerous. A robot car is going to have deaths no matter what. That does not mean they are bad if they mean a reduction of cars and accidents. Taxis if done properly can help a public transport system.

[–] Akisamb@programming.dev 15 points 9 months ago

They gave them a birth control shot without properly informing them of what it was. Still scandalous, but not what you are saying.

[–] Akisamb@programming.dev 3 points 9 months ago

These models do not see letters but tokens. For the model, violet is probably two symbols viol and et. Apart from learning by heart the number of letters in each token, it is impossible for the model to know the number of letters in a word.

This is also why gpt family sucks at addition their tokenizer has symbols for common numbers like 14. This meant that to do 14 + 1 it could not use the knowledge 4 + 1 was 5 as it could not see the link between the token 4 and the token 14. The Llama tokenizer fixes this, and is thus much better at basic algebra even with much smaller models.

[–] Akisamb@programming.dev 5 points 10 months ago (1 children)

Tu fais la ceuillette seul ou tu embauches de l'aide ?

[–] Akisamb@programming.dev 3 points 10 months ago

Yes to your question, but that's not what I was saying.

Here is one of the most popular training datasets : https://pile.eleuther.ai/

If you look at the pdf describing the dataset, you'll find the mean length of these documents to be somewhat short with mean length being less than 20kb (20000 characters) for most documents.

You are asking for a model to retain a memory for the whole duration of a discussion, which can be very long. If I chat for one hour I'll type approximately 8400 words, or around 42KB. Longer than most documents in the training set. If I chat for 20 hours, It'll be longer than almost all the documents in the training set. The model needs to learn how to extract information from a long context and it can't do that well if the documents on which it trained are short.

You are also right that during training the text is cut off. A value I often see is 2k to 8k tokens. This is arbitrary, some models are trained with a cut off of 200k tokens. You can use models on context lengths longer than that what they were trained on (with some caveats) but performance falls of badly.

view more: ‹ prev next ›