[jpa-spec users] [jsr338-experts] Re: support for multitenancy

From: Deepak Anupalli <deepak_at_pramati.com>
Date: Tue, 03 Apr 2012 11:40:09 +0530

On 03-04-2012 02:52, Linda DeMichiel wrote:
> Hi Deepak,
>
> Thanks for the feedback. More below.....
>
> On 4/2/2012 6:14 AM, Deepak Anupalli wrote:
>> Linda,
>>
>> Overall the proposal looks fine. However I was expecting an update to
>> JPA from the SaaS standpoint as well ("Application
>> managed SaaS" in your terminology :)), providing more flexibility to
>> be able to work with the prevailing database
>> partitioning/sharding approaches.
>>
>> (comments inline)
>>
>> On 27-03-2012 05:24, Linda DeMichiel wrote:
>>> One of the main items on the agenda for the JPA 2.1 release is support
>>> for multitenancy in Java EE 7 cloud environments.
>>>
>>> In Java EE 7, an application can be submitted into a cloud environment
>>> for use by multiple tenants in what can be viewed as a basic form of
>>> software as a service (SaaS). The application is customized and
>>> deployed on a per-tenant basis. At runtime, there is a separate
>>> application instance (or set of instances, e.g., in a clustered
>>> environment) per tenant. The instances used by different tenants are
>>> isolated from one another. The resources used by a tenant's
>>> application may also be isolated from one another, or may be shared.
>>> In general, however, it is assumed that a tenant's data is isolated
>>> from that other tenants.
>>>
>>> There are three well-known approaches to support for multitenancy at
>>> the database level:
>>>
>>> (1) separate database approach
>>> (2) shared database / separate schema approach
>>> (3) shared schema / shared table approach
>>>
>>>
>>> To get the discussion started, this is a high-level strawman sketch of
>>> how the 3 approaches might be used with JPA in keeping with the Java
>>> EE 7 approach. At the same time, however, we also want to be sure
>>> that what we specify in JPA 2.1 can be extended to encompass a more
>>> general approach to SaaS in the future in which a single application
>>> instance serves multiple tenants and in which multitenancy is managed
>>> by the Java EE environment.
>>>
>>> For further information on how Java EE 7 is approaching PaaS/SaaS, you
>>> might find the documents on the javaee-spec.java.net project useful,
>>> particularly
>>> http://java.net/projects/javaee-spec/downloads/download/PaaS.pdf
>>> and the latest draft of the Java EE 7 Platform spec,
>>> http://java.net/projects/javaee-spec/downloads/download/JavaEE_Platform_Spec.pdf.
>>>
>>>
>>> Note that the identifier for the tenant will be made available to the
>>> application in JNDI as java:comp/tenantId. The tenantId will be a
>>> string, whose max length should allow it to be portably stored in a
>>> single database column.
>>>
>>> APPROACHES:
>>>
>>> (1) Separate database approach
>>>
>>> In this approach, each tenant's persistence unit is mapped to a
>>> separate database. This approach provides the greatest isolation
>>> between tenants and does not impose any additional constraints over
>>> the object/relational mapping of the persistence unit or over the
>>> operations that can be performed. In particular, the use of
>>> multiple database schemas or catalogs are supported as are native
>>> queries.
>>>
>>> In some cloud environments, use of this approach might not be
>>> available, as a tenant might be allocated storage within a database
>>> rather than a separate database.
>> Most secure & ease of management. I see couple of variants of PaaS
>> providers in the evolving PaaS:
>> A) PaaS provider who provisions database instances
>> B) PaaS provider who integrates with a third-party DBaaS
>>
>> Type A PaaS provider could provide options to create database
>> instance "per application" or "per tenant". Per
>> application gives the best possible isolation.
>>
>
> Not sure I understand what you mean by "per application here": per
> application instance?
> or per application where multiple application instances are using the
> database?

By "per application" I meant dedicating a separate database instance for
each tenant application. The reason for this is one of the tenant
application's could be a beast taking up a lot of database resources.
Provisioning a separate process/instance provides isolation not just for
the data but also for resource consumption as well.

Moreover its not too weird to think of two different tenant applications
using the same schema names, in which case this option also does the job.

>
>> Type B PaaS provider who integrates with existing cloud DBaaS
>> providers like Amazon RDS or Google SQL Cloud
>> (https://developers.google.com/cloud-sql/). Some of these services
>> already provide multiple database instances per
>> user/account which could be easily mapped to "per application" or
>> "per tenant" strategies.
>>
>> The biggest challenge for a Java EE PaaS provider, is to be able to
>> configure/deploy an existing Java EE application
>> without much of re-architecting code. There are so many applications
>> already out there, which have hardcoded schema
>> names and database procedures for which the next two approaches fail
>> to get them onto the cloud.
>
> Yes, applications that were not matched to the facilities of the cloud
> platform provider would
> fail to deploy if they were expecting resources that could not be made
> available. That's one
> of the reasons I think we need to provide metadata as to an
> application's expectations as to
> what it needs in its environment. Providing no such metadata could
> potentially default to
> the "I need a separate database" assumption, but that wouldn't be
> great in terms of scalability.
>

I agree with the metadata being specified to provide the database
mapping strategy, but what I do not agree is the usefulness of "separate
schema per application" and "separate table" approaches.

They not only add complexity to the thinking process, but also influence
the application design in defining schema and data. Java EE PaaS should
be targeting at a wide range of applications, which could be easily
ported to the cloud.

>
>> I'm not in favour of
>> the other two approaches.
>>
>> In Summary, I would only vote for the following:
>> (1) Separate database per application
>> (2) Separate database per tenant
>>
>>>
>>>
>>> (2) Shared database / separate schema approach
>>>
>>> In this approach, each tenant's data is stored in database tables
>>> that are isolated from those of any other tenant. In databases that
>>> support schemas, this will typically be achieved by allocating a
>>> separate schema per tenant. The database's permissions facility is
>>> used to confine a tenant's access to the designated schema, thus
>>> providing isolation between tenants at the schema level.
>>>
>>> Support for this approach is straightforward if the persistence unit
>>> uses only the default schema or catalog (i.e., if it does not specify
>>> schema names or catalogs in the object/relational mapping metadata).
>>> A native query that attempts to access data in a schema other than
>>> that assigned to the tenant by the platform provider will be trapped
>>> by the database authorization mechanisms and will result in an
>>> exception.
>>>
>>> [While the case where the persistence unit metadata explicitly
>>> specifies one or more schemas could potentially be handled by the
>>> persistence provider by remapping schema names and native queries that
>>> embed schema names, I would not propose that we specify or require
>>> support for this case, although a more sophisticated persistence
>>> provider might choose to support it.]
>>>
>>> (3) Shared table approach
>>>
>>> In this approach, database tables are shared ("striped") across
>>> tenants.
>>>
>>> It is the reponsibility of the persistence provider to provide
>>> per-tenant isolation in accessing data. This will typically be done
>>> by mapping and maintaining a tenant ID column in the respective
>>> tables, and augmenting data retrieval and query operations, updates,
>>> and inserts with tenant IDs. The use of native queries would need to
>>> be trapped by the persistence provider and not allowed unless the
>>> persistence provider were able to modify them to provide isolation of
>>> tenant data.
>>>
>>> Ideally, the management of the tenant id should be transparent to the
>>> application, although we should revisit this in Java EE 8 as we move
>>> further into support for SaaS.
>>>
>>> I believe that the main use case for the shared table approach is in
>>> SaaS environments in which a single application instance is servicing
>>> multiple tenants. This is outside the scope of Java EE 7, so I don't
>>> think that we need to standardize on support for this approach now,
>>> although we should not lose sight of it as we standardize on other
>>> aspects.
>>>
>>>
>> Too risky, shouldn't be taking this route.
>
> I don't understand. Can you explain further?

Support for "shared table approach" could be error prone, if the vendor
doesn't carefully implement it and the chances are high in messing up
with application data.

Application deployer wouldn't appreciate provider gaining control of the
application schema.

Why would an application deployer choose the "shared table" approach?
May be he/she doesn't have too much of data right now (less "isolation"
less price to be paid), but as the application grows bigger he/she may
want to move to another approach which provides better isolation
(probably at a higher price). Vendors are bound to not only support
these approaches, but also implement various migration strategies.

I deploy application specifying the database mapping strategy as
"multitenancy = SEPARATE_TABLE" in my persistence.xml and later I want
to re-deploy the same application with "multitenancy = SEPARATE_SCHEMA".
Can we support this and what effort does go into it?

>
>> Rather, we should completely leave it out to the application
>> developer to
>> manage multi-tenancy by providing better support through JPA to
>> address database multi-tenancy approaches. I'm not sure
>> if we can standardize these various approaches through a single API,
>> but can definitely make some progress to be able to
>> catch up with the future.
>>
>> "Application managed SaaS" provides the best degree of control over
>> multi-tenancy and not be able to support that would
>> definitely be a minus. I guess some of the JPA vendors Hibernate,
>> EclipseLink et al. have already introduced support for
>> database multi-tenancy features and we have experts on this group who
>> backs a lot of experience in this area to help in
>> building or standardizing the support for mulit-tenancy.
>>
>
> Just to avoid misunderstanding, by "application-managed multitenancy",
> I mean that the
> application itself is managing multitenancy, not necessarily with any
> additional support from the
> persistence provider or platform. I.e., the app will need to manage
> tenant identity,
> on-boarding of additional tenants, tenant-specific configuration
> information, intermediation
> on access to tenant-specific data, etc., etc.

Yes exactly, see there is so much of effort an application developer has
to put in order to achieve this. Why can't we standardize/productize
these essential things through JPA? In order to get the database
multi-tenancy support, developers are bound to drop JPA and build things
from scratch going by the JDBC route. We really do not want the
developers doing that, do we?

>
>> - Support for separate Read/Write connections to handle database
>> reads & updates separately (For Master/Slave replication)
>> - Support for multi-tenancy at the EMF (mechanism to choose the most
>> appropriate EMF based on tenant/session/other
>> criteria)
>>
>
> This wouldn't be application-managed SaaS in my terminology, but
> rather some point
> intermediate between container-/provider-managed SaaS and
> application-managed SaaS
>
>> Provided there's enough time frame and if the group is inclined
>> towards this, we can definitely brainstorm the
>> possibilities (plus or minus)
>>
>>>
>>> DETERMINING THE MULTITENANCY STORAGE MAPPING STRATEGY:
>>>
>>> We see two general approaches to determining the multitenancy storage
>>> mapping strategy that should be used for a persistence unit. In some
>>> cases, these approaches might be combined.
>>>
>>> Again, note that a cloud platform provider might use a single strategy
>>> for all tenants in allocating database storage. For example, each
>>> tenant might be allocated a separate database, or each tenant might
>>> only be allocated a schema within a database.
>>>
>>>
>>> (A) The Application Specifies Its Requirements
>>>
>>> In this approach, the application specifies its functional
>>> requirements (in terms of need for named, multiple schemas and/or use
>>> of native queries) in the persistence.xml descriptor, and the deployer
>>> and/or cloud platform provider determine the storage strategy that is
>>> used for the tenant. This metadata serves as input to the deployer
>>> for the tenant or as input into the automated provisioning of the
>>> application by the cloud platform provider (if automated provisioning
>>> is supported by the platform instance).
>>>
>>> For example, an application might specify that it requires support for
>>> multiple schemas and native queries. In general, such requirements
>>> would mean that a separate database would need to be provisioned for
>>> the tenant. If this is not possible, then unless the platform
>>> provider supported a persistence provider that could perform schema
>>> remapping and/or modification of native queries, the application might
>>> fail to deploy or fail to initialize. On the other hand, if an
>>> application specifies that it uses only the default schema and native
>>> queries, then either the separate database or separate schema approach
>>> could be used.
>>>
>>>
>>>
>>> (B) The Application Specifies the Multitenancy Storage Mapping Strategy
>>>
>>> An alternative approach is that the application specifies the required
>>> (or preferred) multitenancy storage mapping strategy in the
>>> persistence.xml.
>>>
>>> For example, a multitenant application that is designed with the
>>> intention that separate databases be used might indicate this in the
>>> persistence.xml as multitenancy = SEPARATE_DATABASE.
>>>
>>> An application that is designed with the intention that databases may
>>> be shared by partitioning at the database schema level might indicate
>>> this in the persistence.xml as multitenancy = SHARED_DATABASE. [A
>>> portable application that specifies this strategy should not specify
>>> schema or catalog names, as it might otherwise fail to deploy or fail
>>> to initialize.]
>>>
>>> An application that is designed with the intention that tables be
>>> shared might indicate this in the persistence.xml as multitenancy =
>>> SHARED_SCHEMA. An app that uses explicit multitenant mapping metadata
>>> would be expected to specify this.
>>>
>>> [Open Issue: Is it useful to specify requirements along the lines of
>>> those used in approach (A) with this approach? If so, is the platform
>>> provider allowed to choose a different mapping strategy as long as
>>> that approach is more isolated? If no functional requirements are
>>> specified as in approach (A) and if a mapping strategy is specified in
>>> the persistence.xml that is provided by the application submitter,
>>> then if this information is not observed, the risk is that the app
>>> will fail. For example, observation of the specified mapping strategy
>>> might be required for the case where explicit multitenant mapping
>>> metadata is supplied for the striped mapping approach.]
>>>
>>>
>>> With both the approaches (A) and (B), different storage mapping
>>> strategies may be used for different tenants of the same application
>>> if the cloud platform provider supports a range of storage mapping
>>> choices.
>>>
>>>
>>> REQUIREMENTS FOR PORTABLE APPLICATIONS
>>>
>>> Applications that are intended to be portable in cloud environments
>>> should not specify schema or catalog names.
>>>
>>>
>>> DEPLOYMENT
>>>
>>> When an application instance is deployed for a tenant, the container
>>> needs to make the tenant identifier and tenant-related configuration
>>> information available to the persistence provider. The container
>>> needs to pass to the persistence provider a data source that is
>>> configured with appropriate credentials for the tenant, and which will
>>> provide isolation between that tenant and other tenants of the
>>> application. We should probably also define an interface to capture
>>> the tenant identifier and tenant-related metadata and configuration
>>> information that the container needs to pass to the persistence
>>> provider, e.g., a TenantContext.
>>>
>>>
>>> OTHER OPEN ISSUES
>>>
>>> 1. Additional metadata to support schema generation.
>>>
>>> 2. Do we need metadata to indicate whether an application supports
>>> multitenant use -- i.e., whether it is "multitenant enabled"?
>>> Do we need this information specifically for JPA?
>>>
>>> 3. Specification of resources that are shared across tenants--e.g.,
>>> a persistence unit for reference data that can be accessed by
>>> multiple tenants.
>>>
>>>
>>>
>> -Deepak
>
-Deepak