Documentation for Mighty Inference Server

Guides

Installation and Quick Starts

Welcome to the Mighty manual! Getting started is easy, just download and extract the application, then start a server with a default configuration and model.

Step 1. Download

Mighty is a small executable with some minimal dependencies (also small), that does not need to be installed - it is simply run from the location you extract it to. The only requirements are curl and hwloc.

#Install dependencies
sudo apt-get install curl hwloc libhwloc-dev

#Download the application
curl http://max.io/mighty-linux.tar.gz
tar -zxf mighty-linux.tar.gz
cd mighty

Step 2. Start Mighty

Start the server using the `mighty` executable in the extracted directory. Without any arguments, this will start a server that will serve an embeddings endpoint using the default model.

./mighty
>Mighty server for embeddings is listening on http://localhost:5050

Step 3. Inference!

When the server is running, in another terminal or browser, you can make a request to http://localhost:5050/?text=Hello+Mighty. to get an inference response, by providing some text in the querystring - in this example you will retrieve the embeddings for the phrase `Hello Mighty.`

curl http://localhost:5050/\?text=Hello+Mighty.
>{
>  "took":12,
>  "text":"Hello mighty.",
>  "shape":[5,384],
>  "outputs":[[0.066911593079,0.162673592567,...,-0.233405888080]]
>}

The server returns how long the request took, the text you provided, the output embeddings, and the shape of the embeddings. "Hello Mighty." has 5 tokens, and the embedding dimensions are 384 for the default model. That gives the shape of `[5,384]`. The outputs and this other information can be readily used in your application by making a simple http request to your Mighty server!

Step 4. Stopping the server

You can always stop the server from the terminal in which you opened it using `Ctrl-C`. From another terminal, you can execute `pkill mighty` to kill the process - don't worry, it's safe! Mighty is stateless and can't lose data.

It is common to execute mighty to run as a background process by appending an `&` to the command: `./mighty &`. This will print the startup information and then stay in the background, ready to be stopped anytime with the `pkill mighty` command.

How to use Mighty for your model

In the introduction, we started mighty without any command-line arguments. You can easily specify command-line args to use a different pipeline or model. As each pipeline has a default model configured, you can specify the pipeline and not need to declare the model.

Mighty currently supports embeddings, question-answering, sentence-transformers, and sequence-classification models. These are the equivalent to the pipelines in the Python transformers library.

Here's how to start Mighty using a different pipeline with its default model, and how to request from the endpoint:

./mighty --embeddings
curl localhost:5050/?text=Hello+to+Mighty+on+this+fine+day.

./mighty --question-answering
curl localhost:5050/?question=When+was+mighty+launched%3F&context=In+the+year+2022.

./mighty --sentence-transformers
curl localhost:5050/sentence-transformers/?text=Hello+to+Mighty+on+this+fine+day.

./mighty --sequence-classification
curl localhost:5050/sequence-classification/?text=Hello+to+Mighty+on+this+fine+day.

./mighty --token-classification
curl localhost:5050/sequence-classification/?text=Mighty+was+made+in+Rochester,+NY.

If you know the name of a model that is compatible with Mighty, then you can you can specify it for the specific pipeline. For example:

./mighty --embeddings --model distilroberta-base

or

./mighty --sentence-transformers --model sentence-transformers/msmarco-distilbert-dot-v5

Note: the pipeline and the model are tightly coupled. If a different pipeline specified for a model's capabilities, you will get an HTTP 409 Conflict error at request time. Note that in any case the root embeddings endpoint `/?text=...` is always available for any model.

Converting your own model

To use Mighty with your model, the model needs to be converted to ONNX format, and have a valid tokenizer and config. Compatible models can be easily converted using the tool at https://github.com/maxdotio/mighty-convert

A model is compatible if it is supported by HuggingFace transformers onnx Python module, but the tokenizer and config still need to be normalized to Mighty's requirements, which the mighty-convert tool conveniently does.

Here's how to install the mighty-convert tool. Python 3.8+ and pip are required. It is recommended to use this in a new venv or conda environment.

git clone https://github.com/binarymax/mighty-convert
cd mighty-convert
pip install -r requirements.txt 

Once installed, you can specify a model and an optional pipeline to download and convert from Huggingface Hub!

./mighty-convert.py \ 
 sentence-transformers/msmarco-distilbert-dot-v5 \ #The model name
 sentence-transformers                             #The pipeline

To convert a model that is already on your own machine, just specify the path to the folder!

./mighty-convert.py ~/models/my_model/ sequence-classification

This will convert and save your model into the `output` directory in mighty-convert

If a pipeline is not specified, only embeddings are available when the model is hosted by Mighty Inference Server.

Using your converted model with Mighty

With a newly converted model on the same disk, using it is easy!

./mighty --model ~/mighty-convert/output/my_model --sequence-classification

You can now copy this model folder anywhere you need for use with Mighty.

Production hosting and best practices

While running mighty on a local machine is handy for development, its real power is in a production environment. Mighty was designed to serve inference with the lowest latency and highest throughput possible. It is a small executable Rust application built to run fast and lean, and to maximize the utility of the hardware it is using. The executable is less than 14MB in size, and uses the ONNX Runtime library, which is itself also less than 14MB. Mighty depends on curl, nginx, and hwloc.

  1. A Mighty server is meant to be bound to a CPU core (we'll get to GPU later). If your machine or instance has 32 cores, you can easily start 32 separate Mighty server processes, with each mighty process bound to its own core and memory!
  2. Mighty contains a web server as part of its process, but is meant to be load-balanced to each core by an Nginx reverse proxy.
  3. Doing both of the above steps is super easy. A small script is included in the Mighty distribution `mighty-cluster` which will launch to fill the server's available cores with Mighty processes.
  4. Mighty is stateless, and Mighty servers do not need to communicate with one another - drastically lowering complexity.

Here's how to deploy from scratch on an Ubuntu server:

#Install server dependencies
sudo apt-get install curl nginx hwloc libhwloc-dev

#Download and start a Mighty cluster
curl http://max.io/mighty-linux.tar.gz
tar -zxf mighty-linux.tar.gz
mighty/mighty-cluster

That's it! `mighty-cluster` will start as many cores as available, and automatically configure and start Nginx to listen on port 80. Note importantly, that `mighty-cluster` will accept the same arguments as `mighty`, to load your specific model and pipeline easily from the command line.

If you need to stop Mighty for any reason (perhaps to load a new or different model), just use the same command `pkill mighty` to stop the cluster and restart with `mighty-cluster`! It should go without saying that while the cluster is stopped it can not serve inference requests. So take precautions when reloading a model. Note that it takes a fraction of a second for each core to start, and you should be refreshed very quickly.

Mighty is completely horizontally scalable. This means that you can add as many servers as you need to handle as many requests per second as you need. You can launch as many machine instances as required, each running mighty-cluster, and load-balance the instances accordingly in your production environment. For example, if a single instance can serve 10,000 queries per second (qps) adding a second instance will allow you to serve 20,000 qps. In AWS you can easily spin up 4 instances in a VPC, start mighty-cluster on each of them, and add an AWS Elastic load-balancer. This can all be effectively scripted and tooled by experienced DevOps professionals. The more hardware running mighty, the more requests-per-second it can serve.

Using GPUs.

Mighty can be run with a GPU, but the pattern is slightly different than the above, and uses a specialized `mighty-gpu` executable. To maximize utility of an available CUDA device (only CUDA 10.x+ is supported), mighty-gpu should only be started once and is not clustered.

Here's how to deploy from a CUDA* capable instance on an Ubuntu server

#Install server dependencies
sudo apt-get install curl nginx hwloc libhwloc-dev

#Download and start a Mighty cluster
curl http://max.io/mighty-linux.tar.gz
tar -zxf mighty-linux.tar.gz
mighty/mighty-gpu
(You will need to install CUDA yourself - this is out of scope for this documentation, but AWS has several AMIs available that come with CUDA pre-installed for compatable EC2 instance types)

Should you use a GPU?

While GPUs will reduce latency, they are expensive. If model the latency is acceptable on CPU, then it is typically cheaper to run mighty-cluster. The good news is, both are supported, so if you are considering GPU I encourage you to measure throughput and latency of both for your model in your own environment, and make a cost-benefit decision.

Hosting multiple models

Since you can start Mighty on different cores and ports, you can host more than one model on a system if you have available cores. For example, on a system with 4 cores you can start mighty with 2 cores for sentence-transformers and 2 cores for question answering:

./mighty --sentence-transformers --core 0 --port 5050
./mighty --sentence-transformers --core 1 --port 5051
./mighty --question-answering --core 2 --port 5052
./mighty --question-answering --core 3 --port 5053

With that configuration you can load-balance in nginx with the following configuration:

upstream mighty-sentence-transformers {
  server 127.0.0.1:5050;
  server 127.0.0.1:5051;
}

upstream mighty-question-answering {
  server 127.0.0.1:5052;
  server 127.0.0.1:5053;
}

server {
  listen 80;
  location /sentence-transformers {
    proxy_pass http://mighty-sentence-transformers/sentence-transformers;
  }
}

server {
  listen 80;
  location /question-answering {
    proxy_pass http://mighty-question-answering/question-answering;
  }
}								

Setting up your own model repository

Coming Soon!

API

Command Line executable manual

Run `./mighty -h` to show help from the command line:

Optional arguments:
  -h,--help             Show this help message and exit
  -v,--verbose          Verbose will show the configuration at startup
  -h,--host HOST        The ip address on which to listen (default 127.0.0.1)
  -c,--core CORE        The core on which to bind (only available on Linux)
  -p,--port PORT        The http port number to use
  --embeddings          Enable embeddings (Default)
  --question-answering  Enable question-answering
  --sentence-transformers
                        Enable sentence-transformers
  --sequence-classification
                        Enable sequence-classification
  --token-classification
                        Enable token-classification
  -q,--quantized        Set this to try and use a quantized model version
                        (limited availability due to accuracy reduction)
  -a,--always-download  Set this to force the download a model, even if it
                        exists on disk.
  -d,--only-download    Set this to only download a model and then quit.
  -m,--model MODEL      Location of the model, as a short path, url, or
                        directory. This location must contain config.json,
                        tokenizer.json, and model-[type].onnx

REST Endpoints

All endpoints are available as both GET and POST requests. When using POST, set the header "Content-Type: application/json" and provide the values in a JSON body with the same respective querystring names.

/embeddings

/?text={text}

/embeddings/?text={text}

Returns

  • output (Array[Array[float32]]): the output embeddings
  • took (integer): the inference time in milliseconds (not including web request overhead)
  • text (string): the text repeated verbatim from the request
  • shape (integer,integer): the shape of the output ([M,N] where M is the number of tokens and N is the embedding dimension)

/question-answering

/question-answering?question={question}&context={context}

Returns

  • answer (string): the inferred answer text
  • took (integer): the inference time in milliseconds (not including web request overhead)
  • question (string): the question repeated verbatim from the request
  • context (string): the context repeated verbatim from the request
  • start_idx (integer): the context token offset of the start of the answer text
  • end_idx (integer): the context token offset of the end of the answer text

/sentence-transformers

/sentence-transformers?text={text}

Returns

  • took (integer): the inference time in milliseconds (not including web request overhead)
  • text (string): the text repeated verbatim from the request
  • output (Array[Array[float32]]): the output embeddings
  • shape (integer,integer): the shape of the output ([S,N] where S is the number of sentences and N is the embedding dimension)

/sequence-classification

/sequence-classification?text={text}

Returns

  • took (integer): the inference time in milliseconds (not including web request overhead)
  • text (string): the text repeated verbatim from the request
  • logits (Array[float32]): the output probabilities
  • shape (integer,integer): the shape of the logits ([1,N] where N is the number of labels)

/token-classification

/token-classification?text={text}

Returns

  • took (integer): the inference time in milliseconds (not including web request overhead)
  • text (string): the text repeated verbatim from the request
  • entities (Array[entity]): the named entities extracted, where "entity" is:
    • id (string): the id of the type of the named entity (from config.json)
    • label (string): the label of the type of named entity (from config.json)
    • text (string): the text of the extracted named entity
    • score (float32): the score of the entity being correctly identified
    • offsets (integer,integer): the utf-8 character start and end offsets in the original text
  • shape (integer,integer): the shape of the logits ([1,N] where N is the number of labels)

/metadata

Returns The Mighty Inference Server configuration and model metadata

/healthcheck

Returns A 200 (Success) HTTP response if the server is OK, or a different status code if not OK.

HTTP Error Messages

  • 409 (Conflict): this message is returned when the incorrect pipeline is used with a model (for example, trying to request `/question-answering` for a `sequence-classification` model configuration)
  • 413 (Payload Too Large): this message is returned when you try to send too much data at once for an inference request. This is a safety meant to prevent a server crash if too much memory is needed to consume the text. See the `/metadata` endpoint to find out the maximum size for the currently running model in the `max_embedding_size` response value.
Note: all other Errors are served by Nginx or the system. If Mighty is not running correctly and Nginx is serving requests, Nginx will return errors.