mscharhag, Programming and Stuff;

A blog about programming and software development topics, mostly focused on Java technologies including Java EE, Spring and Grails.

  • Monday, 21 September, 2020

    Command-line JSON processing with jq

    In this post we will learn how to parse, pretty-print and process JSON from the command-line with jq. At the end we will even use jq to do a simple JSON to CSV conversion. jq describes itself as a lightweight and flexible command-line JSON processor. You can think of unix shell tools like sed, grep and awk but for JSON.

    jq works on various platforms. Prebuild binaries are available for Linux, Windows and Mac OS. See the jq download site for instructions.

    For many of the following examples we will use a file named artist.json with the following JSON content:

    {
        "name": "Leonardo da Vinci",
        "artworks": [{
                "name": "Mona Lisa",
                "type": "Painting"
            }, {
                "name": "The Last Supper",
                "type": "Fresco"
            }
        ]
    }

    Pretty-printing JSON and basic jq usage

    jq is typically invoked by piping a piece of JSON to its standard input. For example:

    echo '{ "foo" : "bar" }' | jq
    {
      "foo": "bar"
    }

    Without any arguments jq simply outputs the JSON input data. Note that the output data has been reformatted. jq outputs pretty-printed JSON by default. This lets us pipe minimized JSON to jq and get a nicely formatted output.

    jq accepts one or more filter(s) as parameter. The simplest filter is . which returns the whole JSON document. So this example produces the same output as the previous example:

    echo '{ "foo" : "bar" }' | jq '.'

    We can now add a simple object identifier to the filter. For this we will use the previously mentioned artist.json file. With .name we select the value of the name element:

    cat artist.json | jq '.name'
    "Leonardo da Vinci"
    

    Arrays can be navigated using the [] syntax:

    cat artist.json | jq '.artworks[0]'
    {
      "name": "Mona Lisa",
      "type": "Painting"
    }

    To get the name of the first painting we use:

    cat artist.json | jq '.artworks[0].name'
    "Mona Lisa"

    If we want to get the names of all artworks we simply skip the array index parameter:

    cat artist.json | jq '.artworks[].name'
    "Mona Lisa"
    "The Last Supper"

    Processing curl and wget responses

    Of course we can also pipe responses from remote systems to jq. This is not a specific feature of jq, but because this is a common use-case we look into two short examples. For these examples we will use the public GitHub API to get information about my blog-examples repository.

    With curl this is very simple. This extracts the name and full_name properties from the GitHub API response:

    curl https://api.github.com/repos/mscharhag/blog-examples | jq '.name,.full_name'
    "blog-examples"
    "mscharhag/blog-examples"

    Note we used a comma here to separate different two different filters.

    With wget we need to add a few parameters to get the output in the right format:

    wget -cq https://api.github.com/repos/mscharhag/blog-examples -O - | jq '.owner.html_url'
    "https://github.com/mscharhag"
    

    Pipes, functions and operators

    In this section we will into a more ways of filtering JSON data.

    With the | operator we can combine two filters. It works similar as the standard unix shell pipe. The output of the filter on the left is passed to the one on the right.

    Note that .foo.bar is the same as .foo | .bar (the JSON element .foo is passed to the second filter which then selects .bar).

    Pipes can be combined with functions. For example we can use the keys function to get the keys of an JSON object:

    cat artist.json | jq '. | keys'
    [
      "artworks",
      "name"
    ]

    With the length function we can get the number of elements in an array:

    cat artist.json | jq '.artworks | length'
    2

    The output of the length function depends on the input element:

    • If a string is passed, then it returns the number of characters
    • For arrays the number of elements is returned
    • For objects the number of key-value pairs is returned

    We can combine the length function with comparison operators:

    cat artist.json | jq '.artworks | length < 5'
    true

    Assume we want only the artworks whose type is Painting. We can accomplish this using the select function:

    cat artist.json | jq '.artworks[] | select(.type == "Painting")'
    {
      "name": "Mona Lisa",
      "type": "Painting"
    }

    select accepts an expression and returns only those inputs that match the expression.

    Transforming JSON documents

    In this section we will transform the input JSON document into a completely different format.

    We start with this:

    cat artist.json | jq '{(.name): "foo"}'
    {
      "Leonardo da Vinci": "foo"
    }

    Here we create a new JSON object which uses the .name element as key. To use an expression as an object key we need to add parentheses around the key (this does not apply to values as we will see with the next example)

    Now let's add the list of artworks as value:   

    cat artist.json | jq '{(.name): .artworks}'
    {
      "Leonardo da Vinci": [
        {
          "name": "Mona Lisa",
          "type": "Painting"
        },
        {
          "name": "The Last Supper",
          "type": "Fresco"
        }
      ]
    }

    Next we apply the map function to the artworks array:

    cat artist.json | jq '{(.name): (.artworks | map(.name) )}'
    {
      "Leonardo da Vinci": [
        "Mona Lisa",
        "The Last Supper"
      ]
    }

    map allows us to modify each array element with an expression. Here, we simply select the name value of each array element.

    Using the join function we can join the array elements into a single string:

    cat artist.json | jq '{(.name): (.artworks | map(.name) | join(", "))}'
    {
      "Leonardo da Vinci": "Mona Lisa, The Last Supper"
    }

    The resulting JSON document now contains only the artist and a comma-separated list of his artworks.

    Converting JSON to CSV

    We can also use jq to perform simple JSON to CSV transformation. As example we will transform the artworks array of our artist.json file to CSV.

    We start with adding the .artworks[] filter:

    cat artist.json | jq '.artworks[]'
    {
      "name": "Mona Lisa",
      "type": "Painting"
    }
    {
      "name": "The Last Supper",
      "type": "Fresco"
    }

    This deconstructs the artworks array into separate JSON objects.

    Note: If we would use .artworks (without []) we would get an array containing both elements. By adding [] we get two separate JSON objects we can now process individually.

    Next we convert these JSON objects to arrays. For this we pipe the JSON objects into a new filter:

    cat artist.json | jq '.artworks[] | [.name, .type]'
    [
      "Mona Lisa",
      "Painting"
    ]
    [
      "The Last Supper",
      "Fresco"
    ]

    The new filter returns an JSON array containing two elements (selected by .name and .type)

    Now we can apply the @csv operator which formats a JSON array as CSV row:

    cat artist.json | jq '.artworks[] | [.name, .type] | @csv'
    "\"Mona Lisa\",\"Painting\""
    "\"The Last Supper\",\"Fresco\""

    jq applies JSON encoding to its output by default. Therefore, we now see two CSV rows with JSON escaping, which is not that useful.

    To get the raw CSV output we need to add the -r parameter:

    cat artist.json | jq -r '.artworks[] | [.name, .type] | @csv'
    "Mona Lisa","Painting"
    "The Last Supper","Fresco"
    

    Summary

    jq is a powerful tool for command-line JSON processing. Simple tasks like pretty-printing or extracting a specific value from a JSON document are quickly done in a shell with jq. Furthermore the powerful filter syntax combined with pipes, functions and operators allows us to do more complex operations. We can transform input documents to completely different output documents and even convert JSON to CSV.

    If you want to learn more about jq you should look at its excellent documentation.

     

  • Tuesday, 15 September, 2020

    Implementing the Proxy Pattern in Java

    The Proxy Pattern

    Proxy is a common software design pattern. Wikipedia does a good job describing it like this:

    [..] In short, a proxy is a wrapper or agent object that is being called by the client to access the real serving object behind the scenes. Use of the proxy can simply be forwarding to the real object, or can provide additional logic. [..]

    (Wikipedia)

    UML class diagram:

    proxy pattern

    A client requires a Subject (typically an interface). This subject is implemented by a real implementation (here: RealSubject). A proxy implements the same interface and delegates operations to the real subject while adding its own functionality.

    In the next sections we will see how this pattern can be implemented in Java.

    Creating a simple proxy

    We start with an interface UserProvider (the Subject in the above diagram):

    public interface UserProvider {
        User getUser(int id);
    }

    This interface is implemented by UserProviderImpl (the real implementation):

    public class UserProviderImpl implements UserProvider {
        @Override
        public User getUser(int id) {
            return ...
        }
    }

    UserProvider is used by UsefulService (the client):

    public class UsefulService {
        private final UserProvider userProvider;
    
        public UsefulService(UserProvider userProvider) {
            this.userProvider = userProvider;
        }
        
        // useful methods
    }

    To initialize a UsefulService instance we just have to pass a UserProvider object to the constructor:

    UserProvider userProvider = new DatabaseUserProvider();
    UsefulService service = new UsefulService(userProvider);
    
    // use service

    Now let's add a Proxy object for UserProvider that does some simple logging:

    public class LoggingUserProviderProxy implements UserProvider {
        private final UserProvider userProvider;
    
        public LoggingUserProviderProxy(UserProvider userProvider) {
            this.userProvider = userProvider;
        }
    
        @Override
        public User getUser(int id) {
            System.out.println("Retrieving user with id " + id);
            return userProvider.getUser(id);
        }
    }

    We want to create a proxy for UserProvider, so our proxy needs to implement UserProvider. Within the constructor we accept the real UserProvider implementation. In the getUser(..) method we first write a message to standard out before we delegate the method call to the real implementation.

    To use our Proxy we have to update our initialization code:

    UserProvider userProvider = new UserProviderImpl();
    LoggingUserProviderProxy loggingProxy = new LoggingUserProviderProxy(userProvider);
    UsefulService usefulService = new UsefulService(loggingProxy);
    
    // use service

    Now, whenever UsefulService uses the getUser() method we will see a console message before a User object is returned from UserProviderImpl. With the Proxy pattern we were able to add logging without modifying the client (UsefulService) and the real implementation (UserProviderImpl).

    The problem with manual proxy creation

    The previous solution has a major downside: Our Proxy implementation is bound to the UserProvider interfaces and therefore hard to reuse.

    Proxy logic is often quite generic. Typical use-cases for proxies include caching, access to remote objects or lazy loading.

    However, a proxy needs to implement a specific interface (and its methods). This contradicts with re-usability.

    Solution: JDK Dynamic Proxies

    The JDK provides a standard solution to this problem, called Dynamic Proxies. Dynamic Proxies let us create a implementation for a specific interface at runtime. Method calls on this generated proxy are delegated to an InvocationHandler.

    With Dynamic Proxies the proxy creation looks like this:

    UserProvider userProvider = new DatabaseUserProvider();
    UserProvider proxy = (UserProvider) Proxy.newProxyInstance(
            UserProvider.class.getClassLoader(),
            new Class[]{ UserProvider.class },
            new LoggingInvocationHandler(userProvider)
    );
    UsefulService usefulService = new UsefulService(proxy);

    With Proxy.newProxyInstance(..) we create a new proxy object. This method takes three arguments:

    • The classloader that should be used
    • A list of interfaces that the proxy should implement (here UserProvider)
    • A InvocationHandler implementation

    InvocationHandler is an interface with a single method: invoke(..). This method is called whenever a method on the proxy object is called.

    Our simple LoggingInvocationHandler looks like this:

    public class LoggingInvocationHandler implements InvocationHandler {
    
        private final Object invocationTarget;
    
        public LoggingInvocationHandler(Object invocationTarget) {
            this.invocationTarget = invocationTarget;
        }
    
        @Override
        public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
            System.out.println(String.format("Calling method %s with args: %s",
                    method.getName(), Arrays.toString(args)));
            return method.invoke(invocationTarget, args);
        }
    }

    The invoke(..) method has three parameters:

    • The proxy object on which a method has been called
    • The method that has been called
    • A list of arguments that has been passed to the called method

    We first log the method and the arguments to stdout. Next we delegate the method call to the object that has been passed in the constructor (note we passed the real implementation in the previous snippet).

    The separation of proxy creation (and interface implementation) and proxy logic (via InvocationHandler) supports re-usability. Note we do not have any dependency to the UserProvider interface in our InvocationHandler implementation. In the constructor we accept a generic Object. This gives us the option to reuse the InvocationHandler implementation for different interfaces.

    Limitations of Dynamic Proxies

    Dynamic Proxies always require an interface. We cannot create proxies based on (abstract) classes.

    If this really a great issue for you can look into the byte code manipulation library cglib. cglib is able to create proxy via subclassing and therefore is able to create proxies for classes without requiring an interface.

    Conclusion

    The Proxy Pattern can be quite powerful. It allows us to add functionality without modifying the real implementation or the client.

    Proxies are often used to add some generic functionality to existing classes. Examples include caching, access to remote objects, transaction management or lazy loading.

    With Dynamic Proxies we can separate proxy creation from proxy implementation. Proxy method calls are delegated to an InvocationHandler which can be re-used.

    Note that in some situations the Proxy Pattern can be quite similar to the Decorator pattern (see this Stackoverflow discussion).

     

  • Wednesday, 9 September, 2020

    Quick tip: Referencing other Properties in Spring

    In Spring property (or yaml) files we can reference other properties using the ${..} syntax.

    For example:

    external.host=https://api.external.com
    external.productService=${external.host}/product-service
    external.orderService=${external.host}/order-service
    

    If we now access the external.productService property (e.g. by using the @Value annotation) we will get the value https://api.external.com/product-service.

    For example:

    @Value("${external.productService}")
    private String productServiceUrl; // https://api.external.com/product-service
    

    This way we can avoid duplication of commonly used values in property and yaml files.

  • Wednesday, 2 September, 2020

    REST: Dealing with Pagination

    In a previous post we learned how to retrieve resource collections. When those collections become larger, it is often useful to provide a way for clients to retrieve partial collections.

    Assume we provide an REST API for painting data. Our database might contain thousands of paintings. However, a web interface showing these paintings to users might only be able to show ten paintings at the same time. To view the next paintings the user needs to navigate to the next page which shows the following ten paintings. This process of dividing the content into smaller consumable sections (pages) is called Pagination.

    Pagination can be an essential part of your API if you are dealing with large collections.

    In the following sections we will look at different types of pagination

    Using page and size parameters

    The page parameter tells which page should be returned while size indicates how many elements a page should contain.

    For example, this might return the first page, containing 10 painting resources.

    GET /paintings?page=1&size=10

    To get the next page we simply increase the page parameter by one.

    Unfortunately it is not always clear if pages start counting with 0 or 1, so make sure to document this properly.

    (In my opinion 1 should be preferred because this represents the natural page counting)

    A minor issue with this approach might be that the client cannot change the size parameter for a specific page.

    For example, after getting the first 10 items of a collection by issuing

    GET /paintings?page=1&size=10

    we cannot get the second page with a size of 15 by requesting:

    GET /paintings?page=2&size=15

    This will return the items 15-30 of the collection. So, we missed 5 items (10-14).

    Using offset and limit parameters

    Another, but very similar approach is the use of offset and limit parameters. offset tells the server the number of items that should be skipped, while limit indicates the number of items to be returned.

    For example, this might return the first 10 painting resources:

    GET /paintings?offset=0&limit=10

    An offset parameter of 0 means that no elements should be skipped.

    We can get the following 10 resources by skipping the first 10 resources (= setting the offset to 10):

    GET /paintings?offset=10&limit=10

    This approach is a bit more flexible because offset and limit do not effect each other. So we can increase the limit for a specific page. We just need to make sure to adjust the offset parameter for the next page request accordingly.

    For example, this can be useful if a client displays data using a infinite scrollable list. If the user scrolls faster the client might request a larger chunk of resources with the next request.

    The downsides?

    Both previous solutions can work fine. They are often very easy to implement. However, both share two downsides.

    Depending on the underlying database and data structure you might run into performance problems for large offsets / page numbers. This is often an issue for relational databases (see this Stackoverflow questions for MySQL or this one for Postgres).

    Another problem is resource skipping caused by delete operations. Assume we request the first page by issuing:

    GET /paintings?page=1&size=10

    After we retrieved the response, someone deletes a resource that is located on the first page. Now we request the second page with:

    GET /paintings?page=2&size=10

    We now skipped one resource. Due to the deletion of a resource on the first page, all other resources in the collection move one position forward. The first resource of page two has moved to page one. 

    Seek Pagination

    An approach to solve those downsides is called Seek Pagination. Here, we use resource identifiers to indicate the collection offset.

    For example, this might return the first five resources:

    GET /paintings?limit=5

    Response:

    [
        { "id" : 2, ... },
        { "id" : 3, ... },
        { "id" : 5, ... },
        { "id" : 8, ... },
        { "id" : 9, ... }
    ]

    To get the next five resources, we pass the id of the last resource we received:

    GET /paintings?last_id=9&limit=5

    Response:

    [
        { "id" : 10, ... },
        { "id" : 11, ... },
        { "id" : 13, ... },
        { "id" : 14, ... },
        { "id" : 17, ... }
    ]

    This way we can make sure we do not accidentally skip a resource.

    For a relational database this is now much simpler. It is very likely that we just have to compare the primary key to the last_id parameter. The resulting query probably looks similar to this:

    select * from painting where id > last_id order by id limit 5;

    Response format

    When using JSON, partial results should be returned as JSON object (instead of an JSON array). Beside the collection items the total number of items should be included.

    Example response:

    {
        "total": 4321,
        "items": [
            {
                "id": 1,
                "name": "Mona Lisa",
                "artist": "Leonardo da Vinci"
            }, {
                "id": 2
                "name": "The Starry Night",
                "artist": "Vincent van Gogh"
            }
        ]
    }
    

    When using page and size parameters it is also a good idea to return the total number of available pages.

    Hypermedia controls

    If you are using Hypermedia controls in your API you should also add links for first, last, next and previous pages. This helps decoupling the client from your pagination logic.

    For example:

    GET /paintings?offset=0&limit=10
    {
        "total": 4317,
        "items": [
            {
                "id": 1,
                "name": "Mona Lisa",
                "artist": "Leonardo da Vinci"
            }, {
                "id": 2
                "name": "The Starry Night",
                "artist": "Vincent van Gogh"
            },
            ...
        ],
        "links": [
            { "rel": "self", "href": "/paintings?offset=0&limit=10" },
            { "rel": "next", "href": "/paintings?offset=10&limit=10" },
            { "rel": "last", "href": "/paintings?offset=4310&limit=10" },
            { "rel": "by-offset", "href": "/paintings?offset={offset}&limit=10" }
        ]
    }

    Note that we requested the first page. Therefore the first and previous links are missing. The by-offset link uses an URI-Template, so the client choose an arbitrary offset.

    Range headers and HTTP status 206 (partial content)

    So far we passed pagination options as request parameters. However, we can also follow an alternative approach using Range and Content-Range headers.

    In the next example request the client uses the Range-header to request the first 10 paintings:

    GET /paintings
    Range: items=0-10

    The Range header is used to request only specific parts of the resource and requires the following format:

    Range: <unit>=<range-start>-<range-end>

    With:

    • <unit> - The unit in which the range is specified. Often bytes is used. However, for APIs we can also use something like items.
    • <range-start> - Start of the requested range
    • <range-end> - End of the requested range

    The server responds to this request with HTTP status 206 (Partial Content) which requires a Content-Range header:

    HTTP/1.1 206 Partial Content
    Content-Range: items 0-12/34
    
    [
    	... first 10 items
    ]

    Within the Content-Range header the server communicates the offsets of the returned items and the total amount of items. The required format of Content-Range looks like this:

    Content-Range: <unit> <range-start>-<range-end>/<size>

    With:

    • <unit> - The unit of the returned range
    • <range-start> - Beginning of the range
    • <range-end> - End of the range
    • <size> - Total size of the range. Can be * if the size is not known

    While this approach can work fine, it is usually easier to work with query parameters than parsing Range and Content-Range headers. It is also not possible to provide hypermedia pagination links if we communicate pagination offsets within headers.

     

    Interested in more REST related articles? Have a look at my REST API design page.

  • Thursday, 27 August, 2020

    REST: Retrieving resources

    Retrieving resources is probably the simplest REST API operation. It is implemented by sending a GET request to an appropriate resource URI. Note that GET is a safe HTTP method, so a GET request is not allowed to change resource state. The response format is determined by Content-Negotiation.

    Retrieving collection resources

    Collections are retrieved by sending a GET request to a resource collection.

    For example, a GET request to /paintings might return a collection of painting resources:

    Request:

    GET /paintings
    Accept: application/json
    

    Response:

    HTTP/1.1 200 (Ok)
    Content-Type: application/json
    
    [
        {
            "id": 1,
            "name": "Mona Lisa"
        }, {
            "id": 2
            "name": "The Starry Night"
        }
    ]
    

    The server indicates a successful response using the HTTP 200 status code (see: Common HTTP status codes).

    Note that it can be a good idea to use a JSON object instead of an array as root element. This allows additional collection information and Hypermedia links besides actual collection items.

    Example response:

    HTTP/1.1 200 (Ok)
    Content-Type: application/json
    
    {
        "total": 2,
        "lastUpdated": "2020-01-15T10:30:00",
        "items": [
            {
                "id": 1,
                "name": "Mona Lisa"
            }, {
                "id": 2
                "name": "The Starry Night"
            }
        ],
        "_links": [
            { "rel": "self", "href": "/paintings" }
        ]
    }
    

    If the collection is empty the server should respond with HTTP 200 and an empty collection (instead of returning an error).

    For example:

    HTTP/1.1 200 (Ok)
    Content-Type: application/json
    
    {
        "total": 0,
        "lastUpdated": "2020-01-15T10:30:00",
        "items": [],
        "_links": [
            { "rel": "self", "href": "/paintings" }
        ]
    }
    

    Resource collections are often top level resources without an id (like /products or /paintings) but can also be sub-resources. For example, /artists/42/paintings might represent the collection of painting resources for the artist with id 42.

    Retrieving single resources

    Single resources retrieved in the same way as collections. If the resource is part of a collection it is typically identified by the collection URI plus the resource id.

    For example, a GET request to /paintings/1 might return the painting with id 1:

    Request:

    GET /paintings/1
    Accept: application/json
    

    Response:

    HTTP/1.1 200 (Ok)
    Content-Type: application/json
    Last-Modified: Sat, 16 Feb 2021 12:34:56 GMT
    
    {
        "id": 1,
        "name": "Mona Lisa",
        "artist": "Leonardo da Vinci"
    }
    

    If no resource for the given id is available, HTTP 404 (Not found) should be returned.

    The Last-Modified header

    The previous example also includes a Last-Modified header which tells when the resource has been last modified.

    This gives the option for conditional requests using the If-Modified-Since and If-Unmodified-Since headers. The If-Modified-Since header helsp to support caching while If-Unmodified-Since can be used to avoid the lost-update problem with concurrent resource updates.

    According to RFC 7232 (conditional requests):

    An origin server SHOULD send Last-Modified for any selected representation for which a last modification date can be reasonably and consistently determined, since its use in conditional requests and evaluating cache freshness (RFC7234) results in a substantial reduction of HTTP traffic on the Internet and can be a significant factor in improving service scalability and reliability.

    Interested in more REST related articles? Have a look at my REST API design page.