overview for Martineski

1

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/hmmm@lemmy.fmhy.ml

1 comments fedilink

7

A new sublemmy from the "hmmm" category was just started! Check out hmmmTexts and subscribe to it if you like the content! (lemmy.fmhy.ml)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/hmmm@lemmy.fmhy.ml

1 comments fedilink

Link to the sublemmy: !hmmmtexts@lemmy.fmhy.ml

The other sublemmy from the hmmm category that I didn't make an announcement for: !hmmmgifs@lemmy.fmhy.ml

GPT-4 details leaked (Leak from ~10.07.2023) in c/singularity@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 3 points 1 year ago* (last edited 1 year ago)

Article (behind a paywall) on the leak: https://www.semianalysis.com/p/gpt-4-architecture-infrastructure

14

GPT-4 details leaked (Leak from ~10.07.2023) (archive.is)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

2 comments fedilink

I just copy/pasted what's in the link so formatting may be broken:

GPT-4's details are leaked.

It is over.

Everything is here: twitter.com/i/web/status/1… Parameters count:

GPT-4 is more than 10x the size of GPT-3. We believe it has a total of ~1.8 trillion parameters across 120 layers. Mixture Of Experts - Confirmed.

OpenAI was able to keep costs reasonable by utilizing a mixture of experts (MoE) model. They utilizes 16 experts within their model, each is about ~111B parameters for MLP. 2 of these experts are routed to per forward pass. MoE Routing:

While the literature talks a lot about advanced routing algorithms for choosing which experts to route each token to, OpenAI’s is allegedly quite simple, for the current GPT-4 model.

There roughly ~55B shared parameters for attention. Inference:

Each forward pass inference (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model. Dataset:

GPT-4 is trained on ~13T tokens.

These are not unique tokens, they count the epochs as more tokens as well.

Epoch number: 2 epochs for text-based data and 4 for code-based data.

There is millions of rows of instruction fine-tuning data from ScaleAI & internally. GPT-4 32K

There was an 8k context length (seqlen) for the pre-training phase. The 32k seqlen version of GPT-4 is based on fine-tuning of the 8k after the pre-training. Batch Size:

The batch size was gradually ramped up over a number of days on the cluster, but by the end, OpenAI was using a batch size of 60 million! This, of course, is “only” a batch size of 7.5 million tokens per expert due to not every expert seeing all tokens. For the real batch size: Divide this number by the seq len to get the real batch size. just stop with this misleading numbers already. Parallelism Strategies

To parallelize across all their A100s GPUs They utilized 8-way tensor parallelism as that is the limit for NVLink.

Beyond that, they are using 15-way pipeline parallelism.

(likely used ZeRo Stage 1. It is possible they used block-level FSDP) Training Cost

OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days at about 32% to 36% MFU.

Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from. If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.

(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour.) Mixture of Expert Tradeoffs

There are multiple MoE tradeoffs taken: For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation. This means parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates.

Researchers have shown that using 64 to 128 experts achieves better loss than 16 experts, but that’s purely research. There are multiple reasons to go with fewer experts. One reason for OpenAI choosing 16 experts is because more experts are difficult to generalize at many tasks. More experts can also be more difficult to achieve convergence with. With such a large training run, OpenAI instead chose to be more conservative on the number of experts. GPT-4 Inference Cost

GPT-4 costs 3x that of the 175B parameter Davinchi. This is largely due to the larger clusters required for GPT-4 and much lower utilization achieved. AN estimate of it's costs is $0.0049 cents per 1k tokens for 128 A100s to inference GPT-4 8k seqlen and $0.0021 cents per 1k tokens for 128 H100’s to inference GPT-4 8k seqlen. It should be noted, we assume decent high utilization, and keeping batch sizes high. Multi-Query Attention

OpenAI are using MQA just like everybody else. Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache. Even then, the 32k seqlen GPT-4 definitely cannot run on 40GB A100s, and the 8k is capped on max bsz. Continuous batching

OpenAI implements both variable batch sizes and continuous batching. This is so as to allow some level of maximum latency as well optimizing the inference costs. Vision Multi-Modal

It is a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Flamingo. This adds more parameters on top of the 1.8T of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text only pre-training. On the vision model, OpenAI wanted to train it from scratch, but it wasn’t mature enough, so they wanted to derisk it by starting with text. One of the primary purposes of this vision capability is for autonomous agents able to read web pages and transcribe what’s in images and video. Some of the data they train on is joint data (rendered LaTeX/text), screen shots of web page, youtube videos: sampling frames, and run Whisper around it to get transcript.

[Dont want to say "I told you so" but..] Speculative Decoding

OpenAI might be using speculative decoding on GPT-4's inference. (not sure 100%)

The idea is to use a smaller faster model to decode several tokens in advance, and then feeds them into a large oracle model as a single batch. If the small model was right about its predictions – the larger model agrees and we can decode several tokens in a single batch. But if the larger model rejects the tokens predicted by the draft model then the rest of the batch is discarded. And we continue with the larger model. The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model. Inference Architecture

The inference runs on a cluster of 128 GPUs.

There are multiple of these clusters in multiple datacenters in different locations.

It is done in 8-way tensor parallelism and 16-way pipeline parallelism.

Each node of 8 GPUs has only ~130B parameters, or… twitter.com/i/web/status/1… The model has 120, so it fits in 15 different nodes. [Possibly the there are less layers on the first node since it needs to also compute the embeddings] According to these numbers: OpenAI should have trained on 2x the tokens if they were trying to go by chinchilla's optimal.

[let alone surpass it like we do]

This goes to show that they are struggling to get high quality data. Why no FSDP?

A possible reason for this could be that some of the hardware infra they secured is of an older generation.

This is pretty common at local compute clusters as the organisation usually upgrade the infra in several "waves" to avoid a complete pause of operation.… twitter.com/i/web/status/1… Dataset Mixture

They trained on 13T tokens. CommonCrawl & RefinedWeb are both 5T.

Remove the duplication of tokens from multiple epochs and we get to a much reasonable number of "unaccounted for" tokens: The "secret" data. Which by this point we already get rumors that parts of it came from twitter, reddit & youtube.

[Rumors that start to become lawsuits]

Some speculations are:

LibGen (4M+ books)

Sci-Hub (80M+ papers)

All of GitHub

My own opinion:

The missing dataset it a custom dataset of college textbooks collected by hand for as much courses as possible.

This is very easy to convert to txt file and than with self-instruct into instruction form. This creates the "illusion" that GPT-4 "is smart" no matter who use it.

Computer scientist? sure! it can help you with your questions about P!=NP Philosophy major? It can totally talk to you about epistemology.

Don't you see? It was trained on the textbooks. It is so obvious. There are also papers that try to extract by force memorized parts of books from GPT-4 to understand what it trained on.

There are some books it knows so well that it had seen them for sure.

Moreover, If i remember correctly: It even know the unique ids of project Euler exes.

23

A.I. Health scans are going to become the Norm (Article from 11.07.2023) (aisupremacy.substack.com)

submitted 1 year ago* (last edited 1 year ago) by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

3 comments fedilink

4

Large Language Models as General Pattern Machines (paper from 10.07.2023) (general-pattern-machines.github.io)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

0 comments fedilink

Abstract:

We observe that pre-trained large language models (LLMs) are capable of autoregressively completing complex token sequences -- from arbitrary ones procedurally generated by probabilistic context-free grammars (PCFG), to more rich spatial patterns found in the Abstract Reasoning Corpus (ARC), a general AI benchmark, prompted in the style of ASCII art. Surprisingly, pattern completion proficiency can be partially retained even when the sequences are expressed using tokens randomly sampled from the vocabulary. These results suggest that without any additional training, LLMs can serve as general sequence modelers, driven by in-context learning. In this work, we investigate how these zero-shot capabilities may be applied to problems in robotics -- from extrapolating sequences of numbers that represent states over time to complete simple motions, to least-to-most prompting of reward-conditioned trajectories that can discover and represent closed-loop policies (e.g., a stabilizing controller for CartPole). While difficult to deploy today for real systems due to latency, context size limitations, and compute costs, the approach of using LLMs to drive low-level control may provide an exciting glimpse into how the patterns among words could be transferred to actions.

5

China's Baichuan Intelligent Technology Unveils Open-Source 13B Parameter Large Language Model (Article from 11.07.2023) (www.maginative.com)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

0 comments fedilink

4

Introducing AnthropicAI's Claude 2 (Announcement from 11.07.2023) (www.anthropic.com)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

4 comments fedilink

6

Liquid Neural Networks, A New Idea That Allows AI To Learn Even After Training (video from 9.07.2023) (youtu.be)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

0 comments fedilink

Introducing Llama 2 - Meta's Next-Generation Commercially Viable Open-Source AI & LLM (paper from 18.07.2023) in c/singularity@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 2 points 1 year ago

Please add the date of the source to the tile as per our rule 6. Copy this:

(paper from 18.07.2023)

Thank you.

hmmm in c/hmmm@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 1 points 1 year ago

And he damn does!

Password reset broken on lemmy.fmhy.ml in c/freemediaheckyeah@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 3 points 1 year ago

Naw, this platform is still very buggy so I doubt that it's your fault.

Solved - Unable to create a community... in c/general_discussion@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 3 points 1 year ago (1 children)

The "name" is a permament name of the sublemmy and you can only write small characters and no spaces. "Displayed name" is the one you want to have capitalised characters and other stuff in.

Password reset broken on lemmy.fmhy.ml in c/freemediaheckyeah@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 3 points 1 year ago (2 children)

Wow, that's crazy weird.

LG to offer subscriptions for already purchased appliances and televisions, evolving into a provider for “Home as a Service” in c/righttorepair@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 2 points 1 year ago

You can make a post about this discord server to get more attention if you want.

I can't change my community banner anymore in c/freemediaheckyeah@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 2 points 1 year ago

I get the same error, idk why :x

We just hit 810 subscribers and 35 posts 🔥✊. If you are a lurker please help us grow the community by commenting and posting. ✌️ in c/antiwork@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 2 points 1 year ago

Only on lemmy.fmhy.ml

The Disappearing Computer: An Exclusive Preview of Humane’s Screenless Tech | Imran Chaudhri | TED in c/singularity@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 1 points 1 year ago* (last edited 1 year ago)

Please add the date of the video to the title of the post as per our rule 6. Copy this:

(video from 9.05.2023)

(Edit: recruitment is closed) I'm looking for 3-5 moderators for this sub! in c/antiwork@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 1 points 1 year ago

It's alright.

Solved - Unable to create a community... in c/general_discussion@lemmy.fmhy.ml

[–] Martineski@lemmy.fmhy.ml 2 points 1 year ago

Boost

17

Google DeepMind, OpenAI, and Leading Academics Propose International Institutions for Global AI Governance (article from 12.07.2023) (www.maginative.com)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

3 comments fedilink

8

AI Agents With ‘Multiple Selves’ Learn to Adapt Quickly in a Changing World (article from 11.07.2023) (singularityhub.com)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

0 comments fedilink

Significance

Adaptive agents must continually satisfy a range of distinct and possibly conflicting needs. In most models of learning, a monolithic agent tries to maximize one value that measures how well it balances its needs. However, this task is difficult when the world is changing and needs are many. Here, we considered an agent as a collection of modules, each dedicated to a particular need and competing for control of action. Compared to the standard monolithic approach, modular agents were much better at maintaining homeostasis of a set of internal variables in simulated environments, both static and changing. These results suggest that having “multiple selves” may represent an evolved solution to the universal problem of balancing multiple needs in changing environments.

Abstract

Satisfying a variety of conflicting needs in a changing environment is a fundamental challenge for any adaptive agent. Here, we show that designing an agent in a modular fashion as a collection of subagents, each dedicated to a separate need, powerfully enhanced the agent’s capacity to satisfy its overall needs. We used the formalism of deep reinforcement learning to investigate a biologically relevant multiobjective task: continually maintaining homeostasis of a set of physiologic variables. We then conducted simulations in a variety of environments and compared how modular agents performed relative to standard monolithic agents (i.e., agents that aimed to satisfy all needs in an integrated manner using a single aggregate measure of success). Simulations revealed that modular agents a) exhibited a form of exploration that was intrinsic and emergent rather than extrinsically imposed; b) were robust to changes in nonstationary environments, and c) scaled gracefully in their ability to maintain homeostasis as the number of conflicting objectives increased. Supporting analysis suggested that the robustness to changing environments and increasing numbers of needs were due to intrinsic exploration and efficiency of representation afforded by the modular architecture. These results suggest that the normative principles by which agents have adapted to complex changing environments may also explain why humans have long been described as consisting of “multiple selves.”

3

PIKA LABS: Future of AI Text to Video Creation by pika art. (video from 12.07.2023) (youtu.be)

submitted 1 year ago* (last edited 1 year ago) by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

2 comments fedilink

PIKA LABS site: https://www.pika.art/demo

14

A Hair Loss Study Raises New Questions About Aging Cells (article from 12.07.2023) (www.wired.com)

submitted 1 year ago by Martineski@lemmy.fmhy.ml to c/singularity@lemmy.fmhy.ml

0 comments fedilink

A protein secreted by seemingly dormant cells in skin moles causes hair to grow again. That’s a big—and potentially useful—surprise.